thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Add write limits for tenants in Thanos Receiver

Open douglascamata opened this issue 3 years ago • 11 comments

Is your proposal related to a problem?

Tenants can overload Thanos Receivers with remote write requests and bring the whole system down.

Describe the solution you'd like

I would like put maximum limits on the size of remote write requests, so that a single tenant is less likely to negatively affect others.

These are the proposed "knobs" for limiting (please feel free to propose more and give your opinion) the remote write endpoint usage:

  • Amount of timeseries per request. Labels: tenant and receive instance.
  • Amount of samples per request. Labels: tenant and receive instance.
  • Size in bytes of the incoming request body. Labels: tenant and receive instance.
  • Amount of concurrent HTTP requests. Labels: receive instance.

Hitting the two first limits should trigger an HTTP response with status code 413 (entity too large) and the last one should triggers a 429 (too many requests). In the future, the 413 error can be identified by the remote write clients and the data split into smaller requests under the limits.

I would also like to expose the current values and limits of each "knob" as metrics of Thanos Receive. This would allow easy tracking of the limit system.

To ensure backwards compatibility, limiting should be optional and disabled by default.

For the sake of simplicity and iteration, starting with a global value for each knob (all tenants have the same upper bound limit) and later adding the possibility of configuring different values per tenant.

Describe alternatives you've considered

  • Ask tenants to be mindful of the amount of metrics they are sending
  • Ask tenants to be careful when writing their remote write configurations
  • Adding smarter/stateful rate limit knobs, that could track limits across a ring of Receive. But initial work for this is being started at #5333. Meanwhile, simpler limits can be added and be helpful.

Additional context

  • Similar abandoned proposal: #3963 (PR at #3964)
  • Lots of inspiration being taken from Cortex's limits

TODO

  • [x] Export the criteria for limiting the remote write requests as metrics of Thanos Receive (#5424).
  • [x] Add charts with these metrics in the Receive dashboard. (#5472)
  • [x] Add support for configuring & applying limits (single limit for all tenants, #5527).
  • [ ] In-progress: Add support for per-tenant limits. (#5565)
  • [ ] Add charts with these metrics in the Receive dashboard.

douglascamata avatar Jun 02 '22 08:06 douglascamata

FYI: pressed the button early accidentally. I am still writing.

douglascamata avatar Jun 02 '22 08:06 douglascamata

Ok, I think I am mostly ready with the writing at this moment.

douglascamata avatar Jun 02 '22 12:06 douglascamata

Also considering to add a limit on the maximum amount of labels per timeseries.

douglascamata avatar Jun 02 '22 13:06 douglascamata

Well, not a good idea to record the amount of labels per timeseries: cardinality will be very high. Keeping it out of the plans for now.

douglascamata avatar Jun 02 '22 14:06 douglascamata

I think we can use bloom filter to track per tenant cardinality.

hanjm avatar Jun 03 '22 05:06 hanjm

@hanjm good idea! Thanks for the suggestion, I will investigate how we could use it.

douglascamata avatar Jun 03 '22 08:06 douglascamata

Have you seen how Loki implements this as well? Might give some inspiration (not sure if relevant tho). For example limiting on bytes/s might also be useful. Anyhow, sounds really great to have!

wiardvanrij avatar Jun 09 '22 11:06 wiardvanrij

From community discussion:

  • It makes sense, let's do this
  • Ideally disabled by default to not break compatibiltiy
  • Provide best practices - what works

bwplotka avatar Jun 09 '22 11:06 bwplotka

@wiardvanrij thanks for the tip. I checked out their limits and I believe it makes total sense that we have a limit on request body size too.

I don't want to dive into rate limits (i.e. bytes per second or timeseries per second) in this proposal though, as these add more complications to keep calculating the data. I believe they could be part of a different proposal and possibly take advantage of the outcome of #5415.

This proposal is more for limiting the size of remote write requests, which requires no state and is much easier to implement.

douglascamata avatar Jun 13 '22 10:06 douglascamata

The remote write request body size (in bytes) will be exported as a histogram. And I propose these buckets:

  • 1K, 32K, 256K, 512K, 1M, 16M, 32M.

Please let me know if you have other suggestions.

For the amount of samples & timeseries per request, I would like to also make them a histogram. The bucket would be configurable and a default provided. Which buckets do you think we could provide as default? Possibly generate an exponential bucket set using prometheus.ExponentialBucket?

douglascamata avatar Jun 15 '22 13:06 douglascamata

FYI, the remote write request body size metric was added as a summary without any quantiles defined yet.

douglascamata avatar Jul 06 '22 13:07 douglascamata

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Nov 13 '22 15:11 stale[bot]

I'm on vacations, bot. Don't stale me!

douglascamata avatar Nov 13 '22 16:11 douglascamata

Everything planned in this issue has already been implemented. Closing it.

douglascamata avatar Apr 06 '23 16:04 douglascamata