cortex Ingesters latency and in-flight requests spike right after startup with empty TSDB

Ingesters latency and in-flight requests spike right after startup with empty TSDB

Open pracucci opened this issue 3 years ago • 2 comments

Describe the bug When starting a brand new ingester (empty disk - running blocks storage), as soon as the ingester is registered to the ring and its state switches to ACTIVE, it suddenly receive a bunch of new series. If you target each ingester to have about 1.5M active series, it will have to add 1.5M series to TSDB in a matter of few seconds.

Today, while scaling out a large number of ingesters (50), in few of such ingesters we got a very high latency and a high number of in-flight requests. The high number of in-flight requests caused the memory to increase, until some of these ingesters were OOMKilled.

I've been able to profile the affected ingesters and the following is what I found so far.

1. Number of in-flight push requests skyrocket right after ingester startup

Screenshot 2020-10-14 at 17 04 02

2. The number of TSDB appenders skyrocket too

Screenshot 2020-10-14 at 17 02 59

3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too

Screenshot 2020-10-14 at 17 06 17

4. Lock contention in `Head.getOrCreateWithID()`

With no big surprise, looking at the number of active goroutines, 99.9% where blocked in Head.getOrCreateWithID() due to lock contention.

Screenshot 2020-10-14 at 12 55 34

To Reproduce Haven't found a way to easily reproduce it yet locally or with a stress test, but unfortunately looks that it's not that difficult to reproduce in production (where debugging is harder).

Storage Engine

[x] Blocks
[ ] Chunks

Oct 14 '20 15:10 pracucci

This reminds me of #3097.

Apr 08 '21 15:04 bboreham

We have -ingester.instance-limits.max-inflight-push-requests now which will allow the requests to be capped and avoid OOM, however it will still create a lot of noise from error messages and retried requests.

Aug 17 '21 16:08 bboreham

cortex cortex copied to clipboard

Ingesters latency and in-flight requests spike right after startup with empty TSDB

1. Number of in-flight push requests skyrocket right after ingester startup

2. The number of TSDB appenders skyrocket too

3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too

4. Lock contention in Head.getOrCreateWithID()

cortex
cortex copied to clipboard

4. Lock contention in `Head.getOrCreateWithID()`