cortex
cortex copied to clipboard
Ingesters latency and in-flight requests spike right after startup with empty TSDB
Describe the bug
When starting a brand new ingester (empty disk - running blocks storage), as soon as the ingester is registered to the ring and its state switches to ACTIVE, it suddenly receive a bunch of new series. If you target each ingester to have about 1.5M active series, it will have to add 1.5M series to TSDB in a matter of few seconds.
Today, while scaling out a large number of ingesters (50), in few of such ingesters we got a very high latency and a high number of in-flight requests. The high number of in-flight requests caused the memory to increase, until some of these ingesters were OOMKilled.
I've been able to profile the affected ingesters and the following is what I found so far.
1. Number of in-flight push requests skyrocket right after ingester startup

2. The number of TSDB appenders skyrocket too

3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too

4. Lock contention in Head.getOrCreateWithID()
With no big surprise, looking at the number of active goroutines, 99.9% where blocked in Head.getOrCreateWithID() due to lock contention.

To Reproduce Haven't found a way to easily reproduce it yet locally or with a stress test, but unfortunately looks that it's not that difficult to reproduce in production (where debugging is harder).
Storage Engine
- [x] Blocks
- [ ] Chunks
This reminds me of #3097.
We have -ingester.instance-limits.max-inflight-push-requests now which will allow the requests to be capped and avoid OOM, however it will still create a lot of noise from error messages and retried requests.