mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Cache non-transient error responses from the query-frontend

Open 56quarters opened this issue 1 year ago • 1 comments

What this PR does

Create a new query-frontend middleware that caches errors returned by queries if they are non-transient and will fail again if executed again. This allows us to save work when running a query that hits, e.g., a limit error: running the query again will not help and is a waste of work.

Which issue(s) this PR fixes or relates to

Fixes #2676

Fixes #7340

Notes for reviewers

This is a draft because it's not ready for a complete review:

  • There are no tests at the moment
  • Configuration is incomplete
  • Instrumentation (metrics, traces) is incomplete

What I'm looking for while in draft:

  • Are the error types being cached appropriate?
  • Is the logic for determining if an error and request combination should be cached appropriate?
  • What sort of instrumentation should be added to help understand this feature in production?

Checklist

  • [ ] Tests updated.
  • [ ] Documentation added.
  • [ ] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • [ ] about-versioning.md updated with experimental features.

56quarters avatar Aug 16 '24 17:08 56quarters

Some example of errors that would be cached pulled from the logs of an internal cluster over the last seven days. Each of the errors falls into one of two categories: bad queries or limits. These are the intended type of errors that would be cached by this feature.

Examples of errorType = "execution" that result in HTTP 422:

  • many-to-many matching not allowed: matching labels must be unique on one side. Bad query
  • multiple matches for labels: many-to-one matching must be explicit (group_left/group_right). Bad query
  • multiple matches for labels: grouping labels must ensure unique matches. Bad query
  • expanding series: invalid destination label name in label_join(): . Bad query
  • expanding series: invalid source label name in label_join(): (.*). Bad query
  • Can't query aggregated metric X without aggregation because.... Bad query
  • query processing would load too many samples into memory in query execution. Limit
  • expanding series: the query exceeded the maximum number of chunks (limit: N chunks) (err-mimir-max-chunks-per-query). Limit
  • expanding series: the query time range exceeds the limit (query length: X, limit: Y) (err-mimir-max-query-length). Limit

Examples of errorType = "bad_data" that result in HTTP 400"

  • invalid parameter \"query\": 1:135: parse error: ...". Bad query
  • invalid parameter \"query\": invalid expression type \"range vector\" for range query, must be Scalar or instant Vector. Bad query
  • the total query time range exceeds the limit (query length: X, limit: Y) (err-mimir-max-total-query-length). Limit
  • exceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution. Limit

56quarters avatar Aug 16 '24 17:08 56quarters