cortex query rejection

What this PR does: This PR introduces a new functionality to reject queries based on certain attributes to protect our services from resource-exhausting queries that can lead to service disruptions, such as OOM kills or availability drop for other queries. This change provides a mechanism to block heavy queries before they reach our services, ensuring operational stability and resource efficiency.

Changes Introduced

Query Rejection Mechanism: Rejects queries based on query api_type, regex, time window, time range, step size, user-agent regex, dashboard UID, and panel ID. Configured through runtime config, allowing per-tenant management.
Query Priority Mechanism: Query attributes from the existing query_priority functionality were utilized and extended to support the new properties.
Documentation Generator Fix: Fixed doc-generator to remove duplicate documentation on repeated fields on slices.

Usage details Both query priority and query rejection mechanisms utilize the same config object (query_attribute) to determine if a query matches specified criteria. The query_attribute config defines requirements for a query and results in a match or no match based on it's properties. The query_attribute does not directly reject or prioritize queries; rather, the action (query rejection or priority) is determined by the config where it was used.

All provided properties in query_attribute use an AND operation to decide if a query matches. Matching is only done on fields provided in the config.

Example config:

    query_rejection:
      enabled: true
      query_attributes:
        - regex: .*ALERT.*
          query_step_limit:
            min: 6s
            max: 20s
          dashboard_uid: "dash123"

Example query: curl 'localhost:8005/prometheus/api/v1/query?query=someCustomALERTquery&time=1718383304&end=1718386904&step=7s' -H "User-Agent: other" -H "X-Dashboard-Uid: dash123"

In this example, queries containing the string "ALERT" with a step between 6s and 20s, and a dashboard UID of "dash123" will be rejected. If any criteria specified in query_attribute properties are not met, such as a step of 30s, the query will not be rejected. In the above example, undefined properties (user_agent, time_window, etc.) are ignored, and the query is not checked against them.

Which issue(s) this PR fixes: Fixes #

Checklist

[x] Tests updated
[x] Documentation added
[x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Jun 11 '24 03:06 erlan-z

@erlan-z This is pretty cool. Thank you for doing this!

Things that I would love to see:

use warnExperimentalUse https://github.com/cortexproject/cortex/blob/6dd64fcce93e87497f0e3f6594e74f030b4538b4/pkg/util/log/experimental.go#L18

Document the experimental feature on docs/configuration/v1-guarantees.md

Integration tests

Documented experimental feature on docs/configuration/v1-guarantees.md
Added integration tests.
I didn't use warnExperimentalUse because this feature is enabled or disabled per tenant, not for the entire system. Additionally, it can change at runtime, so I decided not to add the experimental log or metric. Please let me know if you still think we should add it.

Jun 25 '24 22:06 erlan-z

For series/label/label_values, at least one match should match the regex.

I wonder why that's the case. How useful it is to match only one?

As we just go through that matchers and send separate query for each of them, if there is heavy matcher, then whole query will fail. I think if we want to reject/prioritize that heavy queries that match regex, than matching single one should be enough.

Jun 26 '24 18:06 erlan-z

As we just go through that matchers and send separate query for each of them, if there is heavy matcher, then whole query will fail. I think if we want to reject/prioritize that heavy queries that match regex, than matching single one should be enough.

Let's say we know a very heavy matcher {cluster="us-west-2"}. Do we reject the query if users have this matcher in the request? I think it is only expensive if they use this matcher along but should be fine as long as they add more matchers.

Jun 26 '24 18:06 yeya24

I think this LGTM but im afraid of making the cortex config even more complex. I wonder if those kinds of limits would not live better in another component like the auth gateway.

Said that I'm ok with the change but i think we should at least revisit all the other query limits that we already have on cortex and try to unify/simplify them here - so we have a one stop place to define query limits. This can be done in a follow up PR.

Ex of current query limits:

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]


# Limit the query time range (end - start time of range query parameter and max
# - min of data fetched time range). This limit is enforced in the
# query-frontend and ruler (on the received query). 0 to disable.
# CLI flag: -store.max-query-length
[max_query_length: <duration> | default = 0s]

Thoughts @friedrichg @CharlieTLe ?

Jul 02 '24 18:07 alanprot

I think this LGTM but im afraid of making the cortex config even more complex. I wonder if those kinds of limits would not live better in another component like the auth gateway.

Said that I'm ok with the change but i think we should at least revisit all the other query limits that we already have on cortex and try to unify/simplify them here - so we have a one stop place to define query limits. This can be done in a follow up PR.

Ex of current query limits:
# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]

# Maximum duration into the future you can query. 0 to disable.
# CLI flag: -querier.max-query-into-future
[max_query_into_future: <duration> | default = 10m]


# Limit the query time range (end - start time of range query parameter and max
# - min of data fetched time range). This limit is enforced in the
# query-frontend and ruler (on the received query). 0 to disable.
# CLI flag: -store.max-query-length
[max_query_length: <duration> | default = 0s]
Thoughts @friedrichg @CharlieTLe ?

Thanks for the review. Regarding incorporating all those limits, I initially envisioned query_rejection as a mechanism to protect our services from heavy queries, to be used by operators during events. The other mentioned configurations are intended to set predefined limits on queries. However, there seems to be some overlap, and having these configurations spread across different locations complicates things. Technically, they are all limits on queries and can be combined and managed under query_rejection. I'm unsure which approach is best: differentiating these concepts by having configurations in separate places or unifying them under query_rejection as a single location for managing all query-related limits.

The main argument I had against placing it in the auth gateway is that the auth gateway is primarily for authentication and shouldn't be aware of the request content. It shouldn't be parsing the query and used to limit queries based on their characteristics.

Jul 02 '24 19:07 erlan-z

Ok.. Ship it!

I think I would still try to deprecate the overlapping flags in a following PR

Jul 03 '24 21:07 alanprot