loki
loki copied to clipboard
feat(endpoint): Make `/push` endpoint safe to retry (idempotent).
Signed-off-by: Kaviraj [email protected]
What this PR does / why we need it:
This PR make /push
endpoint idempotent thus safe to retry.
Problem
We started seeing some duplicate log entries with Ns precision on our cells.
Cause of the problem
Say our setup runs with replication factor=3.
Single client request recieved on distributors will be sent to three ingesters. And because of quoram, we return success to the client at least 2 of them succeed. Otherwise we ask client to retry.
Consider the case, where it failed in 2 ingesters(A and B) and success in 1 ingester(C), meaning that one ingester already have this log in it's block. Now client retries and if ingester C already got some more requests making it's last line timestamp different, it can accept the old log during retry, making it as duplicate entry.
Also NOTE: this retry can happen for many other reasons, say ingester sent success response but client haven't got it (because of lost in the network timeout between client and distributor)
Solution. Make /push
endpoint idempotent and thus safe to retry by client
- promtail client now sends unique idempotent key per batch push. (single
logproto.PushRequest
) - distributor forwards the idempotent key to the all the downstream ingesters (based on replication factor)
- every ingesters check if log lines were already added to it's block based on idempotent key. if It's already added, then it's a
no-op
for the ingester.
Rollingout
We cannot discard any client request that doesn't have idempotent-key
as during promtail rollout, there may be older versions of promtail keep sending the batches.
So we have taken into account that even batches without idempotent-key would be accepted by ingester during the rollouts. We can make it strict once we have new versions of promtail everywhere.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Checklist
- [ ] Documentation added
- [x] Tests updated
- [ ] Is this an important fix or new feature? Add an entry in the
CHANGELOG.md
. - [ ] Changes that require user attention or interaction to upgrade are documented in
docs/sources/upgrading/_index.md
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki
Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
+ ingester 0.1%
+ distributor 0.3%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0%
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki
Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
+ ingester 0.1%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0%
If we merge this PR, we may not need workaround @slim-bean proposed here. https://github.com/grafana/loki/pull/6642
This looks awesome!
What do you think about renaming IdempotentKey
to just ID
or RequestID
? IdempotentKey
is really specific to what we're accomplishing. If it were renamed, we would be communicating what it is rather than what it's for.
This looks awesome!
What do you think about renaming
IdempotentKey
to justID
orRequestID
?IdempotentKey
is really specific to what we're accomplishing. If it were renamed, we would be communicating what it is rather than what it's for.
+1
What about PushID
, which is a bit less generic than RequestID
.
Another small suggestion: I think it would be a good idea to also log the key in Promtail in case of a retry. That would allow to correlate logs from Promtail and the ingesters.
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki
Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
+ ingester 0.1%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0%
Nice. This is an awesome timing. Would you update the HTTP API as well? Did you consider generating a key by hashing the log event in case it's not supplied?
Would you update the HTTP API as well?
@jeschkies the payload for http push API is logproto.PushRequest
, I added pushID (idempotentKey) directly there making the API safe to retry. Is there anything else you mean here?
Did you consider generating a key by hashing the log event in case it's not supplied?
On the promtail side, yes we auto generate UUID. But on the loki side, no. Because considering the roll-out process, we will have two version of promtail (one that pass idempotent Key and one doesn't, older version). But loki has to accept requests from both.
@chaudum @MasslessParticle thanks for the suggestion. totally make sense. renamed idempotentKey -> pushID.
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki
Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
+ ingester 0.1%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0%
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki
Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
+ ingester 0.1%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0%
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
Seems like we abandoned this. I'm going to close it for now. Please reopen if this is in error!