mimir
mimir copied to clipboard
Cache non-transient error responses from the query-frontend
What this PR does
Create a new query-frontend middleware that caches errors returned by queries if they are non-transient and will fail again if executed again. This allows us to save work when running a query that hits, e.g., a limit error: running the query again will not help and is a waste of work.
Which issue(s) this PR fixes or relates to
Fixes #2676
Fixes #7340
Notes for reviewers
This is a draft because it's not ready for a complete review:
- There are no tests at the moment
- Configuration is incomplete
- Instrumentation (metrics, traces) is incomplete
What I'm looking for while in draft:
- Are the error types being cached appropriate?
- Is the logic for determining if an error and request combination should be cached appropriate?
- What sort of instrumentation should be added to help understand this feature in production?
Checklist
- [ ] Tests updated.
- [ ] Documentation added.
- [ ]
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]. - [ ]
about-versioning.mdupdated with experimental features.
Some example of errors that would be cached pulled from the logs of an internal cluster over the last seven days. Each of the errors falls into one of two categories: bad queries or limits. These are the intended type of errors that would be cached by this feature.
Examples of errorType = "execution" that result in HTTP 422:
many-to-many matching not allowed: matching labels must be unique on one side. Bad querymultiple matches for labels: many-to-one matching must be explicit (group_left/group_right). Bad querymultiple matches for labels: grouping labels must ensure unique matches. Bad queryexpanding series: invalid destination label name in label_join():. Bad queryexpanding series: invalid source label name in label_join(): (.*). Bad queryCan't query aggregated metric X without aggregation because.... Bad queryquery processing would load too many samples into memory in query execution. Limitexpanding series: the query exceeded the maximum number of chunks (limit: N chunks) (err-mimir-max-chunks-per-query). Limitexpanding series: the query time range exceeds the limit (query length: X, limit: Y) (err-mimir-max-query-length). Limit
Examples of errorType = "bad_data" that result in HTTP 400"
invalid parameter \"query\": 1:135: parse error: ...". Bad queryinvalid parameter \"query\": invalid expression type \"range vector\" for range query, must be Scalar or instant Vector. Bad querythe total query time range exceeds the limit (query length: X, limit: Y) (err-mimir-max-total-query-length). Limitexceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution. Limit