thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Receive: Allow remote write request limits to be defined per file and tenant

Open douglascamata opened this issue 1 year ago • 9 comments

  • [x] I added CHANGELOG entry for this change.
  • [ ] Change is not relevant to the end user.

Changes

This is part of the broader work outlined by #5404.

  • Allow request limits & gates to be configured using a file.
  • A user can save the file to disk and pass its path to receive.limits-config-file or pass the file content inline to receive.limits-config.
  • Documentation was updated with an example and explanation about how default and tenant limits interact.

Follow ups

  • Add a mechanism to reload the tenant limit configuration to avoid having to restart the Receive and reprocess the WAL.
    • I am already preparing a solution for this. It'll come in a separate PR soon after this is merged.
  • Eventually move the configuration for the active series limit added in #5520 to the same file.

Verification

  • Unit tests.
  • Ran it locally.

douglascamata avatar Aug 02 '22 16:08 douglascamata

xref: https://github.com/thanos-io/thanos/pull/5527#discussion_r927267408

squat avatar Aug 02 '22 16:08 squat

@yeya24 based on the conversation we had at https://github.com/thanos-io/thanos/pull/5527#discussion_r927267408, after I tried to add CLI args to configure the default limits in this PR while keeping also the file my opinion on this has changed for the following reasons:

  1. It adds complexity for users of Thanos to understand the configuration
    1. The logic becomes convoluted. There would be CLI args that override missing or present default values that override missing tenant values.
    2. It even allows for partial configuration of defaults via CLI and file at the same time, creating potential confusing setups for users (i.e. CLI defines default limit of series while config file defines default limit of request size).
  2. It adds complexity in the code without clear benefits
    1. Each one of these levels have different default values (zero or nil) that have to be handled.
    2. The type we use as a standard for passing configuration files (extflag.PathOrContent) also supports inlining of the configuration as a CLI argument. It's not difficult to use it to inline the configuration proposed in this PR in case you only want to set defaults.

I don't think adding extra CLI args is worth the price in extra complexity given the features and simplicity that we get from extflag.PathOrContent. What do you think?

douglascamata avatar Aug 03 '22 13:08 douglascamata

I found out during verification tests that the default values for limits are not being exported in the Receive's metrics. Will fix this.

edit: Fixed!

douglascamata avatar Aug 04 '22 14:08 douglascamata

Would changing the limits require receivers to be redeployed?

fpetkovski avatar Aug 09 '22 14:08 fpetkovski

@fpetkovski yes. We might want to improve this with some mechanism to reload this configuration in a follow up PR, like the config reload endpoint other components have.

Probably during reload time the limits would be shortly disabled to avoid synchronization issues under high load. WDYT?

douglascamata avatar Aug 09 '22 14:08 douglascamata

Yeah I think config reloading will be more than a nice to have for this feature. Large receiver deployments can take hours to rollout, so we should try to speed up configuration changes if we can.

fpetkovski avatar Aug 09 '22 15:08 fpetkovski

it still is tenant aware, even though it could be just "per label". I assume the conclusion is to use per tenant limits at the end?

I can refactor this to be purely label-aware and take a few more days (or weeks) to get it ready, tested, and documented while the feature doesn't land.

Or land the feature as experimental (very likely to change) and then work on extending and changing it to be just label-aware afterwards. I thought this was our conclusion from the community meeting. Wasn't it, @bwplotka?

douglascamata avatar Aug 11 '22 18:08 douglascamata

I can even already fill up the code, example configuration file, and logs with warnings that the configuration structure will be changing soon to work on a per-label basis.

douglascamata avatar Aug 11 '22 18:08 douglascamata

A point of the discussion: being straightforward that this feature supports tenancy is a plus for people operating Thanos clusters. Changing it to be label aware is friendly to maintainers/contributors (keep tenancy logic away and "hidden") but not to users (tenancy logic is hidden even though Receive is tenant-aware).

douglascamata avatar Aug 11 '22 18:08 douglascamata

There might be other points to consider before committing to make this purely label aware solution too:

  • Does it even make sense to apply a "request body size limit" to a subset of the request?
    • Only applicable limits are maximum amount of timeseries and samples.
  • Does it make sense performance-wise?
    • Regex matching or simple exact match?
    • What would the request response look like if only part of it hit a limit?
  • Complex logic vs simple per tenant limits
    • It becomes possible that two limit rules apply to given set of timeseries, which one would have priority?
    • What if there are multiple labels per group? i.e. labelA=valueA && labelB=valueB.
    • What if there are multiple values to group by, will regex performance be acceptable? i.e. labelA=valueA || labelA=valueB.
    • How to enable configuration if each individual value should be its own group? i.e. apply the limits for each unique value of labelA.

So I prefer to not rush directly into it.

douglascamata avatar Aug 16 '22 14:08 douglascamata

Maybe this can be merged now (once conflicts are resolved) while discussion is under heavy progress, as this feature is experimental and would likely go through some iterations to get prod-ready? 🙂

saswatamcode avatar Aug 29 '22 12:08 saswatamcode