loki icon indicating copy to clipboard operation
loki copied to clipboard

fix ingester: grpc method=/logproto.Querier/Query fail.

Open liguozhong opened this issue 2 years ago • 9 comments

What this PR does / why we need it: When ingester's ctx times out, don't try to use grpc to return data to the querier again. This PR fixes the ingester error rate > 0 for our loki cluster. We rely on this error rate indicator for alarming image

Which issue(s) this PR fixes: Fixes #5964

Special notes for your reviewer:

Checklist

  • [ ] Documentation added
  • [ ] Tests updated
  • [ ] Is this an important fix or new feature? Add an entry in the CHANGELOG.md.
  • [ ] Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

liguozhong avatar Apr 20 '22 14:04 liguozhong

It's not something you added in this PR, but I wanted to note there is a simpler way to do isDone(): https://github.com/grafana/loki/blob/557e6e528059691911945bd7a26e8efded84cd08/pkg/ingester/instance.go#L683-L690

Just check ctx.Err() != nil, as in the docs:

	// If Done is not yet closed, Err returns nil.
	// If Done is closed, Err returns a non-nil error explaining why:

bboreham avatar Apr 21 '22 10:04 bboreham

I find it slightly weird that there are now four calls to isDone() in sendBatches(), three looking at ctx and one looking at queryServer.Context().

If there is a good reason for this, please put it in a comment.

Possibly both loops should check both contexts. Arguably both loops should only check at one point.

bboreham avatar Apr 21 '22 10:04 bboreham

It's not something you added in this PR, but I wanted to note there is a simpler way to do isDone():

https://github.com/grafana/loki/blob/557e6e528059691911945bd7a26e8efded84cd08/pkg/ingester/instance.go#L683-L690

Just check ctx.Err() != nil, as in the docs:

	// If Done is not yet closed, Err returns nil.
	// If Done is closed, Err returns a non-nil error explaining why:

thanks , done.

liguozhong avatar Apr 21 '22 13:04 liguozhong

I find it slightly weird that there are now four calls to isDone() in sendBatches(), three looking at ctx and one looking at queryServer.Context().

If there is a good reason for this, please put it in a comment.

Possibly both loops should check both contexts. Arguably both loops should only check at one point.

_, ctx := stats.NewContext(queryServer.Context())

I checked that ctx comes from queryServer.Context() , so I deleted queryServer.Context() and uniformly used ctx as the parameter of isDone() func.

liguozhong avatar Apr 21 '22 13:04 liguozhong

I like this commit. These messages have similarly annoyed us during monitoring.

splitice avatar May 20 '22 15:05 splitice

Do you mind rebasing it? Other than that, LGTM

DylanGuedes avatar Jun 23 '22 11:06 DylanGuedes

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Aug 13 '22 10:08 stale[bot]

Go away useless bot

splitice avatar Aug 13 '22 10:08 splitice

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Sep 21 '22 06:09 stale[bot]

@liguozhong Can you rebase? Once that is done this can be merged.

MichelHollands avatar Nov 07 '22 15:11 MichelHollands

@liguozhong Can you rebase? Once that is done this can be merged.

done,thanks

liguozhong avatar Nov 08 '22 07:11 liguozhong

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.1%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot avatar Nov 08 '22 07:11 grafanabot

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.1%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot avatar Nov 08 '22 09:11 grafanabot