cortex Add proposal for tenant limits API

What this PR does:

This PR adds a proposal for a tenant limits API.

Which issue(s) this PR fixes: Fixes #

Checklist

[ ] Tests updated
[x] Documentation added
[ ] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Jun 14 '25 00:06 bogdan-st

Hello, this has been something that has been bothering me and my team for a while. I would love to work on this, even though I don't have much knowledge about what it would take to implement something like this and will probably need some guidance.

Jun 14 '25 00:06 bogdan-st

I think we need to have some validations on this API for checking whether a limit update is within a safe ranges.

Also, not all limits should be allowed to be modified through this API. Limits like shard_size should only be modified by the admin

Jun 26 '25 18:06 harry671003

I agree that many limits should only be modified by an admin, we use the tiers defined in the cortex-jsonnet repo and I wrote this proposal with those in mind, there have been few occasions when users needed limit increases other than those.

Related to setting unreasonable limits, I taught a bit about this and, at least in our use case, defining "reasonable" would be pretty hard. It might be perfectly reasonable for a huge user to double their data overnight and not very reasonable for a small one to do the same thing.
Are you thinking about dinamically changing the upper limit based on their current usage? Id like that, but dont know the implications really. Setting some hard limit "in the middle" for everyone might encourage all small users to max it cause why not, and big users wont get any benefit. I guess having the "limit of limits" per tenant might be the correct answer here as I have seen that work in other systems, that will turn 5 limit increase requests into 1-2 hard limit increases. Do you have any other suggestions for this? Id like to add it to this doc as I think it is a pretty important matter.

Jun 26 '25 21:06 bogdan-st

call it hard overrides

For example:
# file: runtime.yaml
# In this example, we're overriding ingestion limits for a single tenant.
overrides:
  "user1":
    ingestion_burst_size: 350000
    ingestion_rate: 350000
    max_global_series_per_metric: 300000
    max_global_series_per_user: 300000
    max_series_per_metric: 0
    max_series_per_user: 0
    max_samples_per_query: 100000
    max_series_per_query: 100000
configurable-overrides:
  "user1":
    ingestion_rate: 700000
    max_global_series_per_user: 700000

configurable-overrides or hard-overrides. I don't know which one communicates better the situation . Everything defined in configurable-overrides can be modified in the overrides

Jun 26 '25 22:06 friedrichg

What about defining a quota unit (the default values) and keeping the "hard limit" as an integer for how many quota units a user can reach? I think changing the limits individually is also useful but since they usually scale as a group I would love this to support these quota units as well, as hard limits for what can be configured but also as a way to batch increase overrides using the api.

Jun 27 '25 12:06 bogdan-st

What about defining a quota unit (the default values) and keeping the "hard limit" as an integer for how many quota units a user can reach? I think changing the limits individually is also useful but since they usually scale as a group I would love this to support these quota units as well, as hard limits for what can be configured but also as a way to batch increase overrides using the api.

I believe you mean increasing quota would increase a couple of limits. But I think there is a misunderstanding. Overrides is more than just limits, it's configuration like DisabledRuleGroups and OutOfOrderTimeWindow.

Jun 27 '25 18:06 friedrichg

Suggestions addressed. Thank you for the help!

Jul 04 '25 21:07 bogdan-st