mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Should Ruler -> Store Gateway context cancels count as errors?

Open callumj opened this issue 2 years ago • 7 comments

Describe the bug

We have imposed some query limits for the ruler (max aggregated chunk size: 500Mb) which will cause the Ruler to cancel the RPC calls to the Store Gateway once the chunk limits hit. This is working great and protects against bad rules. (the query exceeded the aggregated chunks size limit (limit: 50000000 bytes))

The problem is this seems to manifest as an error on the Store Gateway side and will count towards the error rate: Screenshot 2023-01-05 at 10 54 30 AM

Which can be noisy and hard to weed out from real errors.

I was wondering if there would be the possibility of either ignoring these sorts of errors or classifying them differently?

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir (2.5.0)
  2. Load in some expensive rules that exceed a particularly set limit (e.g. could set the limit to be 1kb)
  3. Observe that error rate for store gateway component increases and traces are exported with error information

Expected behavior

Ruler canceling the operation should not impact the error rate of the store gateway or should be classified (as a label) to allow for better breakdown.

Environment

  • Infrastructure: k8s
  • Deployment tool: N/A

Additional Context

callumj avatar Jan 05 '23 15:01 callumj

👋 Hi! Thanks for reporting it. Looks like an error not correctly handled.

To better understand it, could you give me the query run by the "Error rate / component" in the screenshot, please?

pracucci avatar Jan 09 '23 17:01 pracucci

@pracucci sure thing, this is the query generated by the dashboard Mimir / Object Store:

sum by(component) (rate(thanos_objstore_bucket_operation_failures_total{cluster=~"$cluster", namespace=~"$namespace"}[$__rate_interval])) / sum by(component) (rate(thanos_objstore_bucket_operations_total{cluster=~"$cluster", namespace=~"$namespace"}[$__rate_interval]))

callumj avatar Jan 09 '23 17:01 callumj

I'm glad I've asked for the metric :) The metric is tracked in another project, specifically here: https://github.com/thanos-io/objstore/blob/main/objstore.go

However, the failures metric is increased only if context wasn't canceled. For example, see here: https://github.com/thanos-io/objstore/blob/e4d8ba6bc6f3bfe074ca8fe125a1bb17bee4d3fe/objstore.go#L479-L481

At this point, without reproducing the issue myself I'm not sure what specific error is received by the underlying object storage client, so that it gets tracked as a failure.

pracucci avatar Jan 10 '23 15:01 pracucci

However, the failures metric is increased only if context wasn't canceled. For example, see here: https://github.com/thanos-io/objstore/blob/e4d8ba6bc6f3bfe074ca8fe125a1bb17bee4d3fe/objstore.go#L479-L481

At this point, without reproducing the issue myself I'm not sure what specific error is received by the underlying object storage client, so that it gets tracked as a failure.

Interesting, does the error get bubbled up anywhere? I see in tracing that it also gets classified as an error.

callumj avatar Jan 10 '23 16:01 callumj

i think @colega did some work to not classify cancelled contexts as failed queries in #3837, that's still not in any released version though

dimitarvdimitrov avatar Jan 10 '23 16:01 dimitarvdimitrov

i think @colega did some work to not classify cancelled contexts as failed queries in #3837, that's still not in any released version though

Right, but since the failure reported in this issue is tracked in thanos_objstore_bucket_operation_failures_total it's apparently on a lower level (the object storage client).

pracucci avatar Jan 10 '23 16:01 pracucci

apologies, i misread Callum's comment.

dimitarvdimitrov avatar Jan 10 '23 16:01 dimitarvdimitrov