Prometheus crashes with SIGSEGV when accessing mapped chunk during PromQL query
What did you do?
After Prometheus started up, I simulated a disk I/O hang on Prometheus's storage for about 60 seconds. Immediately after the storage became responsive again, I sent PromQL queries.
What did you expect to see?
Prometheus should validate that a chunk reference is valid before attempting to access it. If the reference is invalid, it should fail gracefully by returning an error or logging a warning — not crash with a SIGSEGV.
What did you see instead? Under which circumstances?
N/A
System information
No response
Prometheus version
v2.55.1
Prometheus configuration file
Alertmanager version
Alertmanager configuration file
Logs
unexpected fault address 0x7f833cf2dab8
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x7f833cf2dab8 pc=0x55c4f63cc9fd]
goroutine 660243 gp=0xc03520a540 m=24 mp=0xc000f80008 [running]:
runtime.throw({0x55c4f679ffbd?, 0x0?})
runtime/panic.go:1101 +0x4a fp=0xc0136a32d0 sp=0xc0136a32a0 pc=0x55c4f3cc306a
runtime.sigpanic()
runtime/signal_unix.go:939 +0x26c fp=0xc0136a3330 sp=0xc0136a32d0 pc=0x55c4f3cc522c
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).Chunk(0xc000eb82d0, 0x20332daa0)
github.com/prometheus/prometheus/tsdb/chunks/head_chunks.go:753 +0x47d fp=0xc0136a34f0 sp=0xc0136a3330 pc=0x55c4f63cc9fd
github.com/prometheus/prometheus/tsdb.(*memSeries).chunk(0xc0076536b0, 0x10000000002?, 0x7f8341d8c588?, 0xc00066a2d8)
github.com/prometheus/prometheus/tsdb/head_read.go:462 +0x95 fp=0xc0136a3548 sp=0xc0136a34f0 pc=0x55c4f66a4955
github.com/prometheus/prometheus/tsdb.(*Head).chunkFromSeries(0xc00066a008, 0xc0076536b0, 0x3, 0x10?, 0x1967e683240, 0x1967e6f8540, 0xc000a042c0, 0x0)
github.com/prometheus/prometheus/tsdb/head_read.go:393 +0x9d fp=0xc0136a3630 sp=0xc0136a3548 pc=0x55c4f66a43fd
github.com/prometheus/prometheus/tsdb.(*headChunkReader).chunk(0xc014c171c0, {0x1619b000003, {0x0, 0x0}, 0x1967e4eec51, 0x1967e69ee61}, 0x0)
github.com/prometheus/prometheus/tsdb/head_read.go:379 +0x149 fp=0xc0136a36d8 sp=0xc0136a3630 pc=0x55c4f66a41e9
github.com/prometheus/prometheus/tsdb.(*headChunkReader).ChunkOrIterable(0x0?, {0x1619b000003, {0x0, 0x0}, 0x1967e4eec51, 0x1967e69ee61})
github.com/prometheus/prometheus/tsdb/head_read.go:350 +0x3e fp=0xc0136a3720 sp=0xc0136a36d8 pc=0x55c4f66a3f9e
github.com/prometheus/prometheus/tsdb.(*populateWithDelGenericSeriesIterator).next(0xc01e7ec7e0, 0x0)
github.com/prometheus/prometheus/tsdb/querier.go:655 +0x308 fp=0xc0136a3818 sp=0xc0136a3720 pc=0x55c4f66bc2a8
github.com/prometheus/prometheus/tsdb.(*populateWithDelSeriesIterator).Next(0xc01e7ec7e0)
github.com/prometheus/prometheus/tsdb/querier.go:735 +0x4f fp=0xc0136a3840 sp=0xc0136a3818 pc=0x55c4f66bca0f
github.com/prometheus/prometheus/storage.(*MemoizedSeriesIterator).Reset(...)
github.com/prometheus/prometheus/storage/memoized_iterator.go:62
github.com/prometheus/prometheus/promql.(*evaluator).evalSeries(0xc000c5e460, {0x55c4f8b79b70, 0xc00fb0d290}, {0xc008565408, 0x15c, 0x0?}, 0x0, 0x0)
github.com/prometheus/prometheus/promql/engine.go:1453 +0x285 fp=0xc0136a39c8 sp=0xc0136a3840 pc=0x55c4f6414b85
github.com/prometheus/prometheus/promql.(*evaluator).eval(0xc000c5e460, {0x55c4f8b79b70, 0xc00fb0d1d0}, {0x55c4f8b7dfc0, 0xc003ddcc60})
github.com/prometheus/prometheus/promql/engine.go:1949 +0x1725 fp=0xc0136a4358 sp=0xc0136a39c8 pc=0x55c4f6416e85
github.com/prometheus/prometheus/promql.(*evaluator).Eval(0xc000c5e460, {0x55c4f8b79b70?, 0xc00fb0d1d0?}, {0x55c4f8b7dfc0?, 0xc003ddcc60?})
github.com/prometheus/prometheus/promql/engine.go:1104 +0xaf fp=0xc0136a43f8 sp=0xc0136a4358 pc=0x55c4f6410e2f
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc000c96900, {0x55c4f8b79b70, 0xc00fb0cf90}, 0xc000c5e2a0, 0xc00f67ca50)
github.com/prometheus/prometheus/promql/engine.go:799 +0xbeb fp=0xc0136a4680 sp=0xc0136a43f8 pc=0x55c4f640e7ab
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc000c96900, {0x55c4f8b79b70, 0xc00fb0cf90}, 0xc000c5e2a0)
github.com/prometheus/prometheus/promql/engine.go:678 +0x41a fp=0xc0136a4880 sp=0xc0136a4680 pc=0x55c4f640cb9a
github.com/prometheus/prometheus/promql.(*query).Exec(0xc000c5e2a0, {0x55c4f8b79b70, 0xc00fb0cde0})
github.com/prometheus/prometheus/promql/engine.go:245 +0x1c5 fp=0xc0136a4970 sp=0xc0136a4880 pc=0x55c4f640a485
github.com/prometheus/prometheus/web/api/v1.(*API).queryRange(0xc00111c840, 0xc00999d900)
github.com/prometheus/prometheus/web/api/v1/api.go:562 +0x1146
Are you able to reproduce this with the latest v3 version? Also sharing more about the build, the OS, the FS would help.
Hi @machine424, No, I am not able to reproduce the issue with prometheus v3 yet, my system cannot be upgraded from v2 to v3 due to concerns about NBC risks.
Here is the information requested: Build: prometheus, version 2.55.1 (branch: HEAD, revision: f5b31e57423e8d0b1b868b4412a3aa19cfdfb0c1) build date: 20250516-18:07:34 go version: go1.24.2 platform: linux/amd64 tags: netgo,builtinassets,stringlabels
OS: SUSE Linux Enterprise Server 15 SP6
FS: Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/rbd7 ext4 8154588 760 8137444 1% /data
This is the full backtrace: backtrace.txt
Hi, We would like to suggest: Prometheus should validate that a chunk reference is valid before attempting to access it. If the reference is invalid, it should fail gracefully by returning an error or logging a warning — not crash with a SIGSEGV.
Hi @machine424, What do you think?
Thanks for the details. Unfortunately, without a reproducer we can’t do much. If you or someone else can work on this, we could emulate a slow disk using something like https://enodev.fr/posts/emulate-a-slow-block-device-with-dm-delay.html?utm_source=chatgpt.com I have https://github.com/prometheus/prometheus/pull/16339 that emulates size restricted FS for another reproducer, it can help with how to set up such an env.
Hi @machine424,
Thanks for your suggestion, let me try to reproduce the issue with the latest Prometheus version
A quick check of the code tells me that we haven't really improved the checks in the relevant code area between v2.55 and the current release. I would expect the same error to happen. Question is if it is something that we can actually check for. There are a bunch of checks in place in tsdb/chunks/head_chunks.go - you can check the lines before L753, creating CorruptionErr. So we needed to understand better what's going wrong here to then see if we can check that before accessing the pages in question.
Hi @machine424,
We have just tried to reproduce the issue with the latest prometheus version (v3.4.1). the backtrace is here: new_backtrace.txt
Hi @beorn7, I also checked something about prometheus behaviors, I guess these are what prometheus does:
1.Prometheus scrapes metrics → stores them in memSeries (RAM) 2.When a chunk is full (or after ~2 hours) → the chunk is "cut" 3.The chunk is written to the WAL and chunk file (chunks_head/) 4.The memChunk in RAM is replaced with a mappedChunk (metadata + ref)
If I/O hangs during (3) is running:
- The chunk may be written to chunks_head/ but not yet fully flushed (fsync incomplete).
- The ref points to a theoretically valid offset, but the actual data isn't available yet.
During a query → Prometheus uses chunkDiskMapper.Chunk(ref) to read the chunk from disk If Prometheus queries the chunk immediately, it reads from a not-yet-available memory region → SIGSEGV.
So I think prometheus should have one more check process before accessing the chunk reference.
Hi @machine424 and @beorn7, Do you have any comments?
Thanks for the details, having a test that reproduces the issue with what I shared in https://github.com/prometheus/prometheus/issues/16621#issuecomment-2927060022 or at least the steps to reproduce it manually will be really helpful.
I’m not questioning your diagnosis, but if you need my input, I’ll first need a way to reproduce the issue. I don’t currently have the bandwidth to chase the bug down myself, so any additional information to help reproduce it would help speed things up and make things easier.
Hi @machine424 and @beorn7,
I have checked prometheus source code deeper and I see the issue happened at github.com/prometheus/prometheus/tsdb/chunks/head_chunks.go:753 that line is: chkEnc := mmapFile.byteSlice.Range(chkStart, chkStart+ChunkEncodingSize)[0]
The root cause may be the byteSlice in mmapFile.byteSlice was pointing to a memory-mapped file region that had been revoked or invalidated by the kernel due to the I/O stall (I/O hangs within 60 seconds). Although the slice length appeared valid, when Go tried to access slice[start:end], the kernel could not fulfill the page request → causing a SIGSEGV at the Go runtime level.
As @beorn7 mentioned before, yes, Prometheus already performs checks on offset bounds, CRC validation, file existence, etc. However, Go has no way to check if a memory-mapped region is still valid at the kernel level. So even with correct logic, accessing an invalidated memory region with .Range() → results in an unavoidable SIGSEGV.
Interesting. Thanks for the research.
Questions that come to mind:
- How can we gain more confidence that this is really the problem here?
- If so, is there really nothing we can do from within Go to catch the problem before segfaulting? (If we cannot do anything, this would be another reason to doubt mmap'ing in Go in general. There is also the more or less well-known reason that mmap'ing interferes really badly with the Go scheduler.)
- How commonly does this happen? And under which circumstances? (If this happens only on systems in dire straits, we could maybe say that we also have a lot of other problems at that moment, and crashing Prometheus is not the worst one.)
Hi @beorn7,
How can we gain more confidence that this is really the problem here? -I see this is really an issue with prometheus because it can be reproduced How commonly does this happen? And under which circumstances? -sometimes I/O is unstable (I/O hangs) for a short period of time causes prometheus to crash -> this should not be happening, prometheus should have a mechanism to prevent or recover after the crash.
I packaged prometheus in a helm chart to deploy on kubernetes cluster. There are steps to reproduce the issue: -deploy prometheus and then scraping metrics from other applications with 15s scrape internal (we can use smaller scrape interval values and large number of time series to make the issue happen more often) -create a script with a while true loop to send queries to prometheus continuously -run Chaos Mesh (action: latency) to make prometheus storage hangs in 60 seconds -after 60s, the storage is back, check and ensure Prometheus still works as normal -run Chaos Mesh a second time -after 60s, the storage is back, but promtheus is crashed at this time more detail about Chaos Mesh tool: https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/
Chaos tools have the inherent problem that they might happily create conditions that are exceedingly unlikely to happen in practice, or they might direct our focus towards situations that are irrelevant in practice (e.g. what I sketched out above: if you only get into that situation if your system is toast anyway, there is not much gain in handling it).
SIGSEGV crashes of Prometheus seem to be rare from what I hear. And we don't know how many of the SIGSEGV crashes we see in the wild are actually triggered by IO latency. It is an important insight that you can trigger the SIGSEGV crash by artificially introducing IO latency, but it only proves it is possible but not that every SIGSEGV crash is neccessarily triggered by it.
If there were an easy way of handling this condition, we should of course do it, but it seems to be quite problematic. See next comment.
I tried to do a bit of research, but I'm not an expert in these things…
My understanding so far is that the Go runtime crashes the program by default if it receives a SIGSEGV. However, there is debug.SetPanicOnFault, which we could set for a Goroutine accessing mmap'd memory. The Go runtime triggers a panic rather than a crash, and we could handle the panic.
Concerns that come to mind (again with the disclaimer that I'm not an expert in these things):
- Could we end up in a situation where we handle a panic that we should not have handled because it leaves the program in some undefined state or even just in bad shape where crashing and restarting would actually be the better remediation?
- Even if the panic happens in the case as described here, is it safe to just stop the current query and return a query error to the user (which, I guess, would be the way to handle the panic)?
- Are there other impacts of frequently calling
bedug.SetPanicOnFault, like performance impact? It's calleddebugfor a reason, I guess, maybe not to be used in production…
All of this has to be weighed against the impact of the problem. See above, if this is happening very rarely, we might cause more trouble than benefits by trying to handle it.
Hi @beorn7,
Thank you very much for your idea. Could I ask you a question that I see if I/O the hang 2 times in a row then the issue happens, but if I/O hangs only occur every 2 hours the problem will not happen again. May the reference be renewed or something like that after WAL replays?
Could I ask you a question that I see if I/O the hang 2 times in a row then the issue happens, but if I/O hangs only occur every 2 hours the problem will not happen again. May the reference be renewed or something like that after WAL replays?
That's a good question, to which I have no answer. :)
As said, not really an expert in this area. Maybe some actual expert is reading this and will enlighten us.
Hi @beorn7
Thank you very much for your help from your comments before that has given me more knowledge about prometheus. However, Could you please let me know who is the expert you mentioned?
Could you please let me know who is the expert you mentioned?
I don't know. I'm just hoping an expert is reading this and can chime in.
if I/O hangs only occur every 2 hours the problem will not happen again. May the reference be renewed or something like that after WAL replays?
when a new chunk is written every 2hours, we create a new mmap ref, and that might explain why it doesn't break? i haven't reproduced the bug yet, but is there consensus that a better way to handle this is via capturing the SIGSEGV and either unmmap + mmap the file, or gracefully shutting down prometheus?
Once we reproduce the bug, the ideal fix would be to avoid the SIGSEGV in the first place.
I packaged prometheus in a helm chart to deploy on kubernetes cluster. There are steps to reproduce the issue: -deploy prometheus and then scraping metrics from other applications with 15s scrape internal (we can use smaller scrape interval values and large number of time series to make the issue happen more often) -create a script with a while true loop to send queries to prometheus continuously -run Chaos Mesh (action: latency) to make prometheus storage hangs in 60 seconds -after 60s, the storage is back, check and ensure Prometheus still works as normal -run Chaos Mesh a second time -after 60s, the storage is back, but promtheus is crashed at this time more detail about Chaos Mesh tool: https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/
Hi, I have provided the steps to reproduce the issue above. Currently, I have no way to cover the issue
Hello from the bug scrub!
This seems to be low priority from maintainer side because of the low frequency of reports outside of artificially created scenarios. Also there's a current debate whether to even keep mmap at all.
Still, if somebody with a good understanding of what's going on here is able to provide a fix, we'd appreciate it.