loki feat(endpoint): Make `/push` endpoint safe to retry (idempotent).

Signed-off-by: Kaviraj [email protected]

What this PR does / why we need it: This PR make /push endpoint idempotent thus safe to retry.

Problem

We started seeing some duplicate log entries with Ns precision on our cells.

Screenshot_2022-07-11_14-22-55 Screenshot_2022-07-11_14-22-41 Screenshot_2022-07-11_14-22-19

Cause of the problem

Say our setup runs with replication factor=3.

Single client request recieved on distributors will be sent to three ingesters. And because of quoram, we return success to the client at least 2 of them succeed. Otherwise we ask client to retry.

Consider the case, where it failed in 2 ingesters(A and B) and success in 1 ingester(C), meaning that one ingester already have this log in it's block. Now client retries and if ingester C already got some more requests making it's last line timestamp different, it can accept the old log during retry, making it as duplicate entry.

Also NOTE: this retry can happen for many other reasons, say ingester sent success response but client haven't got it (because of lost in the network timeout between client and distributor)

Solution. Make `/push` endpoint idempotent and thus safe to retry by client

promtail client now sends unique idempotent key per batch push. (single logproto.PushRequest)
distributor forwards the idempotent key to the all the downstream ingesters (based on replication factor)
every ingesters check if log lines were already added to it's block based on idempotent key. if It's already added, then it's a no-op for the ingester.

Rollingout

We cannot discard any client request that doesn't have idempotent-key as during promtail rollout, there may be older versions of promtail keep sending the batches.

So we have taken into account that even batches without idempotent-key would be accepted by ingester during the rollouts. We can make it strict once we have new versions of promtail everywhere.

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Checklist

[ ] Documentation added
[x] Tests updated
[ ] Is this an important fix or new feature? Add an entry in the CHANGELOG.md.
[ ] Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

Jul 09 '22 20:07 kavirajk

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0.1%
+        distributor	0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 10 '22 21:07 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0.1%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 11 '22 12:07 grafanabot

If we merge this PR, we may not need workaround @slim-bean proposed here. https://github.com/grafana/loki/pull/6642

Jul 11 '22 13:07 kavirajk

This looks awesome!

What do you think about renaming IdempotentKey to just ID or RequestID? IdempotentKey is really specific to what we're accomplishing. If it were renamed, we would be communicating what it is rather than what it's for.

Jul 11 '22 19:07 MasslessParticle

This looks awesome!

What do you think about renaming IdempotentKey to just ID or RequestID? IdempotentKey is really specific to what we're accomplishing. If it were renamed, we would be communicating what it is rather than what it's for.

+1

What about PushID, which is a bit less generic than RequestID.

Another small suggestion: I think it would be a good idea to also log the key in Promtail in case of a retry. That would allow to correlate logs from Promtail and the ingesters.

Jul 13 '22 07:07 chaudum

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0.1%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 18 '22 08:07 grafanabot

Nice. This is an awesome timing. Would you update the HTTP API as well? Did you consider generating a key by hashing the log event in case it's not supplied?

Would you update the HTTP API as well?

@jeschkies the payload for http push API is logproto.PushRequest, I added pushID (idempotentKey) directly there making the API safe to retry. Is there anything else you mean here?

Did you consider generating a key by hashing the log event in case it's not supplied?

On the promtail side, yes we auto generate UUID. But on the loki side, no. Because considering the roll-out process, we will have two version of promtail (one that pass idempotent Key and one doesn't, older version). But loki has to accept requests from both.

Jul 18 '22 08:07 kavirajk

@chaudum @MasslessParticle thanks for the suggestion. totally make sense. renamed idempotentKey -> pushID.

Jul 18 '22 08:07 kavirajk

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0.1%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 18 '22 08:07 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0.1%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 18 '22 11:07 grafanabot

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

Sep 21 '22 11:09 stale[bot]

Seems like we abandoned this. I'm going to close it for now. Please reopen if this is in error!

Jan 13 '23 16:01 MasslessParticle

loki loki copied to clipboard

feat(endpoint): Make `/push` endpoint safe to retry (idempotent).

Problem

Cause of the problem

Solution. Make /push endpoint idempotent and thus safe to retry by client

Rollingout

loki
loki copied to clipboard

Solution. Make `/push` endpoint idempotent and thus safe to retry by client