thanos Add write limits for tenants in Thanos Receiver

Is your proposal related to a problem?

Tenants can overload Thanos Receivers with remote write requests and bring the whole system down.

Describe the solution you'd like

I would like put maximum limits on the size of remote write requests, so that a single tenant is less likely to negatively affect others.

These are the proposed "knobs" for limiting (please feel free to propose more and give your opinion) the remote write endpoint usage:

Amount of timeseries per request. Labels: tenant and receive instance.
Amount of samples per request. Labels: tenant and receive instance.
Size in bytes of the incoming request body. Labels: tenant and receive instance.
Amount of concurrent HTTP requests. Labels: receive instance.

Hitting the two first limits should trigger an HTTP response with status code 413 (entity too large) and the last one should triggers a 429 (too many requests). In the future, the 413 error can be identified by the remote write clients and the data split into smaller requests under the limits.

I would also like to expose the current values and limits of each "knob" as metrics of Thanos Receive. This would allow easy tracking of the limit system.

To ensure backwards compatibility, limiting should be optional and disabled by default.

For the sake of simplicity and iteration, starting with a global value for each knob (all tenants have the same upper bound limit) and later adding the possibility of configuring different values per tenant.

Describe alternatives you've considered

Ask tenants to be mindful of the amount of metrics they are sending
Ask tenants to be careful when writing their remote write configurations
Adding smarter/stateful rate limit knobs, that could track limits across a ring of Receive. But initial work for this is being started at #5333. Meanwhile, simpler limits can be added and be helpful.

Additional context

Similar abandoned proposal: #3963 (PR at #3964)
Lots of inspiration being taken from Cortex's limits

TODO

[x] Export the criteria for limiting the remote write requests as metrics of Thanos Receive (#5424).
[x] Add charts with these metrics in the Receive dashboard. (#5472)
[x] Add support for configuring & applying limits (single limit for all tenants, #5527).
[ ] In-progress: Add support for per-tenant limits. (#5565)
[ ] Add charts with these metrics in the Receive dashboard.

Jun 02 '22 08:06 douglascamata

FYI: pressed the button early accidentally. I am still writing.

Jun 02 '22 08:06 douglascamata

Ok, I think I am mostly ready with the writing at this moment.

Jun 02 '22 12:06 douglascamata

Also considering to add a limit on the maximum amount of labels per timeseries.

Jun 02 '22 13:06 douglascamata

Well, not a good idea to record the amount of labels per timeseries: cardinality will be very high. Keeping it out of the plans for now.

Jun 02 '22 14:06 douglascamata

I think we can use bloom filter to track per tenant cardinality.

Jun 03 '22 05:06 hanjm

@hanjm good idea! Thanks for the suggestion, I will investigate how we could use it.

Jun 03 '22 08:06 douglascamata

Have you seen how Loki implements this as well? Might give some inspiration (not sure if relevant tho). For example limiting on bytes/s might also be useful. Anyhow, sounds really great to have!

Jun 09 '22 11:06 wiardvanrij

From community discussion:

It makes sense, let's do this
Ideally disabled by default to not break compatibiltiy
Provide best practices - what works

Jun 09 '22 11:06 bwplotka

@wiardvanrij thanks for the tip. I checked out their limits and I believe it makes total sense that we have a limit on request body size too.

I don't want to dive into rate limits (i.e. bytes per second or timeseries per second) in this proposal though, as these add more complications to keep calculating the data. I believe they could be part of a different proposal and possibly take advantage of the outcome of #5415.

This proposal is more for limiting the size of remote write requests, which requires no state and is much easier to implement.

Jun 13 '22 10:06 douglascamata

The remote write request body size (in bytes) will be exported as a histogram. And I propose these buckets:

1K, 32K, 256K, 512K, 1M, 16M, 32M.

Please let me know if you have other suggestions.

For the amount of samples & timeseries per request, I would like to also make them a histogram. The bucket would be configurable and a default provided. Which buckets do you think we could provide as default? Possibly generate an exponential bucket set using prometheus.ExponentialBucket?

Jun 15 '22 13:06 douglascamata

FYI, the remote write request body size metric was added as a summary without any quantiles defined yet.

Jul 06 '22 13:07 douglascamata

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Nov 13 '22 15:11 stale[bot]

I'm on vacations, bot. Don't stale me!

Nov 13 '22 16:11 douglascamata

Everything planned in this issue has already been implemented. Closing it.

Apr 06 '23 16:04 douglascamata

thanos thanos copied to clipboard

Add write limits for tenants in Thanos Receiver

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

TODO

thanos
thanos copied to clipboard