thanos
thanos copied to clipboard
receive: memory spikes during tenant WAL truncation
Thanos, Prometheus and Golang version used:
thanos, version 0.34.1 (branch: HEAD, revision: 4cf1559998bf6d8db3f9ca0fde2a00d217d4e23e)
build user: root@61db75277a55
build date: 20240219-17:13:48
go version: go1.21.7
platform: linux/amd64
tags: netgo
Object Storage Provider: GCS
What happened: We have several prometheus instances remote writing to a set of 30 receivers. The receivers normally hover around 8GiB of memory, but once every 2 hours the memory spikes up across all receivers at the same time by roughly 20-25%.
And the corresponding WAL truncations across all receivers.
There are other memory spikes that I'm not certain the root cause, like at 6:30 and 9:07. But looking at receiver memory usage over the past 2 weeks, there are consistent spikes when tenant WAL truncations happen.
What you expected to happen: No memory spikes during WAL truncation, or the ability to stagger when truncation happens.
How to reproduce it (as minimally and precisely as possible): Unsure, I'm running a fairly standard remote-write + receiver setup. I've raised this in the CNCF Slack and at least one other person has observed the memory spikes as well.
Full logs to relevant components:
Anything else we need to know:
This seems to coincide with intervals when head compaction happens. I think this process acquires a write lock and pending samples pile up in memory. @yeya24 do you see something similar in Cortex?
This seems to coincide with intervals when head compaction happens. I think this process acquires a write lock and pending samples pile up in memory. @yeya24 do you see something similar in Cortex?
Just confirming the compactions happen at the same time as the memory spikes
did you get context deadline exceeded (500) from ingestors during the WAL compaction?
Yeah, this optimization is something that needs to be done on Prometheus side :/ I think this is the hot path: https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L1543-L1554
Some improvements that could be made IMHO: https://github.com/prometheus/prometheus/pull/13642 https://github.com/prometheus/prometheus/pull/13632
Cortex and Mimir solve this by adding jitter between compaction for different tenants. We can disable automatic compaction in the TSDB and manage it ourselves.
IDK if is applicable to thanos as well, but we recently added the jitter by AZ so we make sure only 1 az is performing head compaction - and as we replicate in quorum the data the overall latency is not affected. https://github.com/cortexproject/cortex/pull/5928
We are also seeing OOM when thanos receive pod restarts, because of surge of # gorountines and go_sync_mutex_wait_total_seconds_total.
We did a profile and found it stuck at acquiring lock for creating new since we suppose that it needs to build reverse index from scratch after thanos receive restart since all data has been flushed, causing lock contention.
We think this https://github.com/prometheus/prometheus/pull/13642 would fix the issue and we should accelerating pushing to merge this into prometheus.