loki
loki copied to clipboard
fix ingester: grpc method=/logproto.Querier/Query fail.
What this PR does / why we need it:
When ingester's ctx times out, don't try to use grpc to return data to the querier again.
This PR fixes the ingester error rate > 0 for our loki cluster.
We rely on this error rate indicator for alarming
Which issue(s) this PR fixes: Fixes #5964
Special notes for your reviewer:
Checklist
- [ ] Documentation added
- [ ] Tests updated
- [ ] Is this an important fix or new feature? Add an entry in the
CHANGELOG.md
. - [ ] Changes that require user attention or interaction to upgrade are documented in
docs/sources/upgrading/_index.md
It's not something you added in this PR, but I wanted to note there is a simpler way to do isDone()
: https://github.com/grafana/loki/blob/557e6e528059691911945bd7a26e8efded84cd08/pkg/ingester/instance.go#L683-L690
Just check ctx.Err() != nil
, as in the docs:
// If Done is not yet closed, Err returns nil.
// If Done is closed, Err returns a non-nil error explaining why:
I find it slightly weird that there are now four calls to isDone()
in sendBatches()
, three looking at ctx
and one looking at queryServer.Context()
.
If there is a good reason for this, please put it in a comment.
Possibly both loops should check both contexts. Arguably both loops should only check at one point.
It's not something you added in this PR, but I wanted to note there is a simpler way to do
isDone()
:https://github.com/grafana/loki/blob/557e6e528059691911945bd7a26e8efded84cd08/pkg/ingester/instance.go#L683-L690
Just check
ctx.Err() != nil
, as in the docs:// If Done is not yet closed, Err returns nil. // If Done is closed, Err returns a non-nil error explaining why:
thanks , done.
I find it slightly weird that there are now four calls to
isDone()
insendBatches()
, three looking atctx
and one looking atqueryServer.Context()
.If there is a good reason for this, please put it in a comment.
Possibly both loops should check both contexts. Arguably both loops should only check at one point.
_, ctx := stats.NewContext(queryServer.Context())
I checked that ctx comes from queryServer.Context() , so I deleted queryServer.Context() and uniformly used ctx as the parameter of isDone() func.
I like this commit. These messages have similarly annoyed us during monitoring.
Do you mind rebasing it? Other than that, LGTM
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
Go away useless bot
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
@liguozhong Can you rebase? Once that is done this can be merged.
@liguozhong Can you rebase? Once that is done this can be merged.
done,thanks
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki
Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
- ingester -0.1%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0%
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki
Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
- ingester -0.1%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0%