prometheus
prometheus copied to clipboard
Allow mmap chunks during head compaction
I was running some load test with native histograms and i noticed a huge spike in memory during head compaction.
Upon some investigation i noticed that the reason why this is hapenning is cause we were not calling h.mmapHeadChunks during head compaction even though calling this at this point seems safe - this call is not even locked by any mutex.
We were no calling it simply cause the call was placed after the mutex to call db.reloadBlocks() on the following loop:
https://github.com/prometheus/prometheus/blob/ee5bba07c003af88d233083cdd65ac6d064b5b82/tsdb/db.go#L1056-L1067
Which is locked here during head compaction.
This PR is just creating a new go routine to keep calling h.mmapHeadChunks and not be blocked by that mutex.
This are the results before and after the change (CPU and memory now and 5d ago - without this change):
https://github.com/prometheus/prometheus/pull/10709#issuecomment-1129330804
Hi @prymitive thanks for pointing that out.
I can try to make it in the same go routine, but for this we would need to use something like TryLock, which i wanted to avoid it. WDYT?
The thing is, with native histograms this spike is way more notable for 2 main reasons:
- We cut head chunks way more often as the head chunk gets to the 1kb limit quite fast if the histogram have multiples buckets (once every 2 min in our case with 300 buckets) - FloatHistograms even faster.
- The size of the wall can become quite large - proportional to the number of buckets on the histogram sample - which takes a long time to truncate (and the lock is still being hold while wall is being truncated)
We can see below the number of chunks created x the number of mmaped before and after the change (we can see that before the change we kept creating chunks but we were not mmaping them during the head compaction/wall truncation, causing the memory to spike at that time (in our case a spike from 3gb to 10gb + which is quite a lot) Before:
After:
Since the objection is to CPU usage, could you add a CPU chart too please?
Hi @bboreham thanks for looking.
There is already a graph on the description that shows CPU before and after... it's hard anything as sometimes we can see exactly the same values and sometimes just a bit more? (hard to say)...
Maybe we could run prombench here?
Sorry the charts looked similar so I didn't realise one was CPU and one was memory.
Something is using a lot more than one core; is it because this is actually Cortex?
Yeah... this is cortex and is a sum of multiples instances (ingesters). Heap Size is the average btw the ingesters instances.
The spikes in CPU and Memory is exactly during the head compaction time (every 2 hours). Those graphs are comparing before and after i applied this change (one has a offset of 6 days which was before the patch). In this we can see that the memory spike is gone after the patch and the CPU seems similar.
We cut head chunks way more often as the head chunk gets to the 1kb limit quite fast if the histogram have multiples buckets (once every 2 min in our case with 300 buckets) - FloatHistograms even faster.
It was pretty much an ad-hoc decision to stick with the 1kb limit for histogram chunks and merely require that at least 10 histograms are in each chunk. Have you played with increasing that limit? It would be interesting to explore the trade-offs.
It was pretty much an ad-hoc decision to stick with the 1kb limit for histogram chunks and merely require that at least 10 histograms are in each chunk. Have you played with increasing that limit? It would be interesting to explore the trade-offs.
I can do this test but i think what will happen is the baseline heap size will increase as we will keep 4kb instead of 1kb in memory before cutting a new head chunk and mmaping it,
OK, thanks for your thought about this.
Anyway, this concrete PR is about a different approach to the problem. @codesome you are most familiar with mmap'd chunks. What do you think about this PR?
@codesome 🥺 :)
In your comparison charts, the CPU is spiking higher. And as I understand, CPU is much more expensive than giving a little more memory.
Can you please check what is causing the extra CPU use?
The change in general looks innocent, but I would like to be cautious about any extra CPU cost.
In your comparison charts, the CPU is spiking higher. And as I understand, CPU is much more expensive than giving a little more memory.
Can you please check what is causing the extra CPU use?
The change in general looks innocent, but I would like to be cautious about any extra CPU cost.
Thanks a lot for looking at this @codesome :D:D:D
This first 3 spikes are are very similar, no? Indeed the 2 last ones are a bit higher (2-4%) but i think its just some variance on the runs TBH.. i can try to rerun the tests to make sure.
CPU is much more expensive than giving a little more memory.
This scenario might not be entirely realistic since all the series in this load test are native histograms. However, during head compaction, memory usage spikes dramatically from 3GB to 10GB, which seems excessive, doesn't it? On the other hand, CPU usage increased by only about 2% during 3 of the 6 head compactions. While CPU resources are generally more expensive, the difference in CPU usage appears minimal compared to the significant memory spike. And since memory isn’t compressible, this spike could easily lead to out-of-memory (OOM) errors during head compaction?
Is it possible to run prom bench here to see if we can capture something there as well? IDK if currently we ingest native histogram there but at least we may see if this cause more CPU with normal samples.
/prombench main
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️
Compared versions: PR-14610 and main
After successful deployment, the benchmarking results can be viewed at:
Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main
(comparing against main is best done right after rebasing your project on main, but since your timezone is very different to mine I'll kick it off and we can see how it looks after a few hours)
/prombench cancel
Benchmark cancel is in progress.
Something happened. This PR started to behave weirdly after some point. @alanprot could you investigate what happened? Before the anomaly, there is a slight increase in CPU with no impact on memory, probably because we do not have native histograms in prombench. In general I think we could try this out and revert back if it causes a problem, after we have fixed the potential issue i.e.
CPU:
Memory:
Parca only have the profiles after 3:42? (after the problem hapenned? :( )
Seems for some reason i dont see the (*DB).run go routine on the new version at 3:42... but i wonder if im looking the correct profile: Link
Parca is configured with one hour of retention.
Ok.. I did find the deadlock between this 2 go routines:
(it happens during series GC and maybe my tests did not had churn so i did not see this happening)
1 runtime.gopark /usr/local/go/src/runtime/proc.go:402
runtime.goparkunlock /usr/local/go/src/runtime/proc.go:408 (inline)
runtime.semacquire1 /usr/local/go/src/runtime/sema.go:160
sync.runtime_SemacquireMutex /usr/local/go/src/runtime/sema.go:77
sync.(*Mutex).lockSlow /usr/local/go/src/sync/mutex.go:171
sync.(*Mutex).Lock /usr/local/go/src/sync/mutex.go:90 (inline)
github.com/prometheus/prometheus/tsdb.(*Head).mmapHeadChunks /go/src/github.com/prometheus/prometheus/tsdb/head.go:1758
github.com/prometheus/prometheus/tsdb.(*DB).run.func1 /go/src/github.com/prometheus/prometheus/tsdb/db.go:1055
1 runtime.gopark /usr/local/go/src/runtime/proc.go:402
runtime.goparkunlock /usr/local/go/src/runtime/proc.go:408 (inline)
runtime.semacquire1 /usr/local/go/src/runtime/sema.go:160
sync.runtime_SemacquireRWMutex /usr/local/go/src/runtime/sema.go:87
sync.(*RWMutex).Lock /usr/local/go/src/sync/rwmutex.go:151
github.com/prometheus/prometheus/tsdb.(*stripeSeries).gc.func1 /go/src/github.com/prometheus/prometheus/tsdb/head.go:1944
github.com/prometheus/prometheus/tsdb.(*stripeSeries).iterForDeletion /go/src/github.com/prometheus/prometheus/tsdb/head.go:1983
github.com/prometheus/prometheus/tsdb.(*stripeSeries).gc /go/src/github.com/prometheus/prometheus/tsdb/head.go:1955
github.com/prometheus/prometheus/tsdb.(*Head).gc /go/src/github.com/prometheus/prometheus/tsdb/head.go:1573
github.com/prometheus/prometheus/tsdb.(*Head).truncateSeriesAndChunkDiskMapper /go/src/github.com/prometheus/prometheus/tsdb/head.go:1372
github.com/prometheus/prometheus/tsdb.(*Head).truncateMemory /go/src/github.com/prometheus/prometheus/tsdb/head.go:1170
github.com/prometheus/prometheus/tsdb.(*DB).compactHead /go/src/github.com/prometheus/prometheus/tsdb/db.go:1433
github.com/prometheus/prometheus/tsdb.(*DB).Compact /go/src/github.com/prometheus/prometheus/tsdb/db.go:1261
github.com/prometheus/prometheus/tsdb.(*DB).run.func2 /go/src/github.com/prometheus/prometheus/tsdb/db.go:1088
I will see how i can better fix this.
I think this deadlock can happens nowadays as well in a very specific scenario: If the autoCompact is disabled and we call db.Compact from another go routine at the same time mmapHeadChunks is called by the db.run method.
It does not happen often because as soon as we call db.Compact, the db.Run go routine gets stuck in the lock - but if we were very unluck and called db.Compact at the same time the db.run is on this line, it could still happen.
Ok.. i created a test to reproduce the problem. Here is the run is the deadlock happening the test failing:
panic: test timed out after 10m0s
running tests:
TestMMapAndCompactConcurrently (9m25s)
goroutine 6291 [running]:
testing.(*M).startAlarm.func1()
/usr/local/go/src/testing/testing.go:2366 +0x385
The deadlock occurs because mmapHeadChunks and gc can acquire locks in a different sequence. mmapHeadChunks first locks stripeSeries and then lock the series itself, while the gc method locks the stripe in the stripeSeries, then the series itself, and then potentially another stripe on the stripeSeries - when deleting a time series. This happens because a time series can be present in two stripes— one cause its ID and another cause its hash.
To prevent a deadlock, I used the TryLock method, and, if the series is already locked, I called mmapChunks outside of the striped lock. This should be ok since series.mmapChunks is still protected by the its lock. Another potential solution might be to create an additional lock on the stripeSeries and prevent the gc and mmapHeadChunks methods from running concurrently - this should still be ok to address the problem raised by this pr as gc is way faster than the whole compact method.
With all that said, I wonder why we don't mmap the chunks as soon as a new head chunk is cut, which would eliminate the need for this background task that scans all series to find those with chunks to be mmaped - this seems to be an discussion for another PR but m more than happy to try it out if we thing it make sense.
PS: i also rebased the PR.
I think the background mapping is to avoid a spike in IO, because it’s typical for every series to cut a new chunk at the same time. Look back in git history, you should find the reason.
I think the background mapping is to avoid a spike in IO, because it’s typical for every series to cut a new chunk at the same time.
That makes sense. Doing it in the background might still cause the same spike, but it wouldn't block new appends, i guess?
Still, wouldn't it make more sense to keep track of the newly cut chunks instead of scanning all time series every minute?
I wonder if there is a fundamental design problem as the same locks are used to lock different data structures (series and hashes) and, because of this, they can be locked in random orders causing the deadlock.
Certainly it's important not to have two bits of code that can take the locks in different orders. I'm not sure I would call having one such bit of code a design problem.
Certainly it's important not to have two bits of code that can take the locks in different orders. I'm not sure I would call having one such bit of code a design problem.
yeah... But as for now I don't see any way to fix this without a potential performance penalty (like unnecessary locks) except by using the TryLock method. Using TryLock in the mmapHeadChunks function might be acceptable since it can be considered a best-effort approach. If some series aren't mapped in the current call, it's not a big issue because this function is called every minute.
The PR right now is even keeping track of the series it could not lock and mmaping its chunks outside of the stripped lock but i think this is unnecessary.
WDYT?
I created an alternative approach for this: https://github.com/prometheus/prometheus/pull/14752
Can we run prom bench there just to see the comparison?
We discussed this with @alanprot and @yeya24 during KubeCon NA.
It's pretty hard to find this discussion (it's buried in this PR) but I think the gist is that, with native histograms (and especially with nhcb) the characteristic for WAL operations are different, thus different locking/head compaction flow could be considered. Perhaps a separate discussion should be made for this, but first we have to reproduce the problem with prombench IMO. Otherwise it's not sustainable to fix it, if we can't test against re-introducing the problem. Let's get a regression benchmark possible, I have some ideas for prombench improvements in mind.
I think you also created pure blocks for native histograms only, which might be unrealistic in practice too (is that right?).