mimir Stagger ingester compactions, to reduce resource peaks

Prometheus, and hence Mimir, does a "head compaction" every 2 hours, which is very disk and CPU intensive. Each compaction takes only a minute or so, for an ingester with 1-2 million timeseries. The data being compacted is from -3 hours to -1 hours; there is one hour grace period in case more data comes in.

So, we don't have to run all compactions exactly at the hour, we could start 2-3 minutes early or late without any impact. Doing this will spread out the peak load, so we can run just as effectively with lower provision.

We shouldn't stagger blocks for the same tenant too far apart, since we want the compactor to see all three replicas before it starts work. This suggests the staggering could be a hash of the tenant ID.

Aug 11 '22 13:08 bboreham

Why only spread them out by 2-3 minutes instead of doing something like hash(<tenant>) % 120 and then offset the compaction time by the resulting number of minutes?

Note: I did % 120 because we want the blocks to get compacted once every 2h.

This way the compaction times would be completely different from tenant to tenant, but each ingester would still compact the blocks of a given tenant at the same time.

I'm guessing that this would also spread out the compactor load better.

Aug 11 '22 23:08 replay

Note: I did % 120 because we want the blocks to get compacted once every 2h.

Once the TSDB head compaction threshold is reached, the longer you delay the compaction, the longer the stale series will stay in the head (this has negative side effects both on memory utilization and our limits).

Aug 12 '22 06:08 pracucci

Nit: I said to spread them 2-3 minutes early or late, so total spread 6 minutes. There is some spreading in current behaviour since we do max 10 concurrently; I guess it would be worth analysing in detail.

The earlier you push it, the less you can accommodate late samples. With out-of-order support that behaviour changes; maybe it's ok to go much earlier.

And going later delays garbage-collection of stale series, as Marco said.

Aug 15 '22 14:08 bboreham

Once the TSDB head compaction threshold is reached, the longer you delay the compaction, the longer the stale series will stay in the head (this has negative side effects both on memory utilization and our limits).

Could we move the timing of the head compaction as well by the same offset? Basically what I'm trying to say is: why not just move the timing of everything that happens once every 2h by the same offset of hash(<tenant>) % 120.

Aug 15 '22 21:08 replay

Could we move the timing of the head compaction as well by the same offset? Basically what I'm trying to say is: why not just move the timing of everything that happens once every 2h by the same offset of hash() % 120.

I think it will make reasoning about the system even more complex, and I wouldn't add such extra complexity. Right now, it's easy to reason about blocks: for each tenant, each block time range is 2h and aligned.

Aug 16 '22 06:08 pracucci

After some discussion internally at Grafana Labs, I think we could easily achieve something similar just setting -blocks-storage.tsdb.head-compaction-concurrency=1 (default is 5). I would give it a try before working on a more elaborated solution.

Aug 25 '22 14:08 pracucci