thanos receive: memory spikes during tenant WAL truncation

Thanos, Prometheus and Golang version used:

thanos, version 0.34.1 (branch: HEAD, revision: 4cf1559998bf6d8db3f9ca0fde2a00d217d4e23e)
  build user:       root@61db75277a55
  build date:       20240219-17:13:48
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider: GCS

What happened: We have several prometheus instances remote writing to a set of 30 receivers. The receivers normally hover around 8GiB of memory, but once every 2 hours the memory spikes up across all receivers at the same time by roughly 20-25%.

And the corresponding WAL truncations across all receivers.

There are other memory spikes that I'm not certain the root cause, like at 6:30 and 9:07. But looking at receiver memory usage over the past 2 weeks, there are consistent spikes when tenant WAL truncations happen.

What you expected to happen: No memory spikes during WAL truncation, or the ability to stagger when truncation happens.

How to reproduce it (as minimally and precisely as possible): Unsure, I'm running a fairly standard remote-write + receiver setup. I've raised this in the CNCF Slack and at least one other person has observed the memory spikes as well.

Full logs to relevant components:

Anything else we need to know:

Apr 02 '24 18:04 Nashluffy

This seems to coincide with intervals when head compaction happens. I think this process acquires a write lock and pending samples pile up in memory. @yeya24 do you see something similar in Cortex?

Apr 02 '24 19:04 fpetkovski

This seems to coincide with intervals when head compaction happens. I think this process acquires a write lock and pending samples pile up in memory. @yeya24 do you see something similar in Cortex?

Just confirming the compactions happen at the same time as the memory spikes

Apr 02 '24 21:04 Nashluffy

did you get context deadline exceeded (500) from ingestors during the WAL compaction?

Apr 10 '24 23:04 jnyi

Yeah, this optimization is something that needs to be done on Prometheus side :/ I think this is the hot path: https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L1543-L1554

Some improvements that could be made IMHO: https://github.com/prometheus/prometheus/pull/13642 https://github.com/prometheus/prometheus/pull/13632

Apr 29 '24 08:04 GiedriusS

Cortex and Mimir solve this by adding jitter between compaction for different tenants. We can disable automatic compaction in the TSDB and manage it ourselves.

Apr 29 '24 09:04 fpetkovski

IDK if is applicable to thanos as well, but we recently added the jitter by AZ so we make sure only 1 az is performing head compaction - and as we replicate in quorum the data the overall latency is not affected. https://github.com/cortexproject/cortex/pull/5928

Jun 10 '24 16:06 alanprot

We are also seeing OOM when thanos receive pod restarts, because of surge of # gorountines and go_sync_mutex_wait_total_seconds_total.
We did a profile and found it stuck at acquiring lock for creating new since we suppose that it needs to build reverse index from scratch after thanos receive restart since all data has been flushed, causing lock contention. We think this https://github.com/prometheus/prometheus/pull/13642 would fix the issue and we should accelerating pushing to merge this into prometheus.

Jul 16 '24 02:07 christopherzli

thanos thanos copied to clipboard

receive: memory spikes during tenant WAL truncation

thanos
thanos copied to clipboard