mimir
mimir copied to clipboard
Should Ruler -> Store Gateway context cancels count as errors?
Describe the bug
We have imposed some query limits for the ruler (max aggregated chunk size: 500Mb) which will cause the Ruler to cancel the RPC calls to the Store Gateway once the chunk limits hit. This is working great and protects against bad rules. (the query exceeded the aggregated chunks size limit (limit: 50000000 bytes)
)
The problem is this seems to manifest as an error on the Store Gateway side and will count towards the error rate:
Which can be noisy and hard to weed out from real errors.
I was wondering if there would be the possibility of either ignoring these sorts of errors or classifying them differently?
To Reproduce
Steps to reproduce the behavior:
- Start Mimir (2.5.0)
- Load in some expensive rules that exceed a particularly set limit (e.g. could set the limit to be 1kb)
- Observe that error rate for store gateway component increases and traces are exported with error information
Expected behavior
Ruler canceling the operation should not impact the error rate of the store gateway or should be classified (as a label) to allow for better breakdown.
Environment
- Infrastructure: k8s
- Deployment tool: N/A
Additional Context
👋 Hi! Thanks for reporting it. Looks like an error not correctly handled.
To better understand it, could you give me the query run by the "Error rate / component" in the screenshot, please?
@pracucci sure thing, this is the query generated by the dashboard Mimir / Object Store
:
sum by(component) (rate(thanos_objstore_bucket_operation_failures_total{cluster=~"$cluster", namespace=~"$namespace"}[$__rate_interval])) / sum by(component) (rate(thanos_objstore_bucket_operations_total{cluster=~"$cluster", namespace=~"$namespace"}[$__rate_interval]))
I'm glad I've asked for the metric :) The metric is tracked in another project, specifically here: https://github.com/thanos-io/objstore/blob/main/objstore.go
However, the failures metric is increased only if context wasn't canceled. For example, see here: https://github.com/thanos-io/objstore/blob/e4d8ba6bc6f3bfe074ca8fe125a1bb17bee4d3fe/objstore.go#L479-L481
At this point, without reproducing the issue myself I'm not sure what specific error is received by the underlying object storage client, so that it gets tracked as a failure.
However, the failures metric is increased only if context wasn't canceled. For example, see here: https://github.com/thanos-io/objstore/blob/e4d8ba6bc6f3bfe074ca8fe125a1bb17bee4d3fe/objstore.go#L479-L481
At this point, without reproducing the issue myself I'm not sure what specific error is received by the underlying object storage client, so that it gets tracked as a failure.
Interesting, does the error get bubbled up anywhere? I see in tracing that it also gets classified as an error.
i think @colega did some work to not classify cancelled contexts as failed queries in #3837, that's still not in any released version though
i think @colega did some work to not classify cancelled contexts as failed queries in #3837, that's still not in any released version though
Right, but since the failure reported in this issue is tracked in thanos_objstore_bucket_operation_failures_total
it's apparently on a lower level (the object storage client).
apologies, i misread Callum's comment.