prometheus Prometheus crashes with SIGSEGV when accessing mapped chunk during PromQL query

What did you do?

After Prometheus started up, I simulated a disk I/O hang on Prometheus's storage for about 60 seconds. Immediately after the storage became responsive again, I sent PromQL queries.

What did you expect to see?

Prometheus should validate that a chunk reference is valid before attempting to access it. If the reference is invalid, it should fail gracefully by returning an error or logging a warning — not crash with a SIGSEGV.

What did you see instead? Under which circumstances?

N/A

System information

No response

Prometheus version

v2.55.1

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

unexpected fault address 0x7f833cf2dab8
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x7f833cf2dab8 pc=0x55c4f63cc9fd]
goroutine 660243 gp=0xc03520a540 m=24 mp=0xc000f80008 [running]:
runtime.throw({0x55c4f679ffbd?, 0x0?})
	runtime/panic.go:1101 +0x4a fp=0xc0136a32d0 sp=0xc0136a32a0 pc=0x55c4f3cc306a
runtime.sigpanic()
	runtime/signal_unix.go:939 +0x26c fp=0xc0136a3330 sp=0xc0136a32d0 pc=0x55c4f3cc522c
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).Chunk(0xc000eb82d0, 0x20332daa0)
	github.com/prometheus/prometheus/tsdb/chunks/head_chunks.go:753 +0x47d fp=0xc0136a34f0 sp=0xc0136a3330 pc=0x55c4f63cc9fd
github.com/prometheus/prometheus/tsdb.(*memSeries).chunk(0xc0076536b0, 0x10000000002?, 0x7f8341d8c588?, 0xc00066a2d8)
	github.com/prometheus/prometheus/tsdb/head_read.go:462 +0x95 fp=0xc0136a3548 sp=0xc0136a34f0 pc=0x55c4f66a4955
github.com/prometheus/prometheus/tsdb.(*Head).chunkFromSeries(0xc00066a008, 0xc0076536b0, 0x3, 0x10?, 0x1967e683240, 0x1967e6f8540, 0xc000a042c0, 0x0)
	github.com/prometheus/prometheus/tsdb/head_read.go:393 +0x9d fp=0xc0136a3630 sp=0xc0136a3548 pc=0x55c4f66a43fd
github.com/prometheus/prometheus/tsdb.(*headChunkReader).chunk(0xc014c171c0, {0x1619b000003, {0x0, 0x0}, 0x1967e4eec51, 0x1967e69ee61}, 0x0)
	github.com/prometheus/prometheus/tsdb/head_read.go:379 +0x149 fp=0xc0136a36d8 sp=0xc0136a3630 pc=0x55c4f66a41e9
github.com/prometheus/prometheus/tsdb.(*headChunkReader).ChunkOrIterable(0x0?, {0x1619b000003, {0x0, 0x0}, 0x1967e4eec51, 0x1967e69ee61})
	github.com/prometheus/prometheus/tsdb/head_read.go:350 +0x3e fp=0xc0136a3720 sp=0xc0136a36d8 pc=0x55c4f66a3f9e
github.com/prometheus/prometheus/tsdb.(*populateWithDelGenericSeriesIterator).next(0xc01e7ec7e0, 0x0)
	github.com/prometheus/prometheus/tsdb/querier.go:655 +0x308 fp=0xc0136a3818 sp=0xc0136a3720 pc=0x55c4f66bc2a8
github.com/prometheus/prometheus/tsdb.(*populateWithDelSeriesIterator).Next(0xc01e7ec7e0)
	github.com/prometheus/prometheus/tsdb/querier.go:735 +0x4f fp=0xc0136a3840 sp=0xc0136a3818 pc=0x55c4f66bca0f
github.com/prometheus/prometheus/storage.(*MemoizedSeriesIterator).Reset(...)
	github.com/prometheus/prometheus/storage/memoized_iterator.go:62
github.com/prometheus/prometheus/promql.(*evaluator).evalSeries(0xc000c5e460, {0x55c4f8b79b70, 0xc00fb0d290}, {0xc008565408, 0x15c, 0x0?}, 0x0, 0x0)
	github.com/prometheus/prometheus/promql/engine.go:1453 +0x285 fp=0xc0136a39c8 sp=0xc0136a3840 pc=0x55c4f6414b85
github.com/prometheus/prometheus/promql.(*evaluator).eval(0xc000c5e460, {0x55c4f8b79b70, 0xc00fb0d1d0}, {0x55c4f8b7dfc0, 0xc003ddcc60})
	github.com/prometheus/prometheus/promql/engine.go:1949 +0x1725 fp=0xc0136a4358 sp=0xc0136a39c8 pc=0x55c4f6416e85
github.com/prometheus/prometheus/promql.(*evaluator).Eval(0xc000c5e460, {0x55c4f8b79b70?, 0xc00fb0d1d0?}, {0x55c4f8b7dfc0?, 0xc003ddcc60?})
	github.com/prometheus/prometheus/promql/engine.go:1104 +0xaf fp=0xc0136a43f8 sp=0xc0136a4358 pc=0x55c4f6410e2f
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc000c96900, {0x55c4f8b79b70, 0xc00fb0cf90}, 0xc000c5e2a0, 0xc00f67ca50)
	github.com/prometheus/prometheus/promql/engine.go:799 +0xbeb fp=0xc0136a4680 sp=0xc0136a43f8 pc=0x55c4f640e7ab
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc000c96900, {0x55c4f8b79b70, 0xc00fb0cf90}, 0xc000c5e2a0)
	github.com/prometheus/prometheus/promql/engine.go:678 +0x41a fp=0xc0136a4880 sp=0xc0136a4680 pc=0x55c4f640cb9a
github.com/prometheus/prometheus/promql.(*query).Exec(0xc000c5e2a0, {0x55c4f8b79b70, 0xc00fb0cde0})
	github.com/prometheus/prometheus/promql/engine.go:245 +0x1c5 fp=0xc0136a4970 sp=0xc0136a4880 pc=0x55c4f640a485
github.com/prometheus/prometheus/web/api/v1.(*API).queryRange(0xc00111c840, 0xc00999d900)
	github.com/prometheus/prometheus/web/api/v1/api.go:562 +0x1146

May 21 '25 05:05 lvphucitus

Are you able to reproduce this with the latest v3 version? Also sharing more about the build, the OS, the FS would help.

May 22 '25 10:05 machine424

Hi @machine424, No, I am not able to reproduce the issue with prometheus v3 yet, my system cannot be upgraded from v2 to v3 due to concerns about NBC risks.

Here is the information requested: Build: prometheus, version 2.55.1 (branch: HEAD, revision: f5b31e57423e8d0b1b868b4412a3aa19cfdfb0c1) build date: 20250516-18:07:34 go version: go1.24.2 platform: linux/amd64 tags: netgo,builtinassets,stringlabels

OS: SUSE Linux Enterprise Server 15 SP6

FS: Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/rbd7 ext4 8154588 760 8137444 1% /data

May 22 '25 13:05 lvphucitus

This is the full backtrace: backtrace.txt

May 22 '25 13:05 lvphucitus

Hi, We would like to suggest: Prometheus should validate that a chunk reference is valid before attempting to access it. If the reference is invalid, it should fail gracefully by returning an error or logging a warning — not crash with a SIGSEGV.

Hi @machine424, What do you think?

May 27 '25 03:05 lvphucitus

Thanks for the details. Unfortunately, without a reproducer we can’t do much. If you or someone else can work on this, we could emulate a slow disk using something like https://enodev.fr/posts/emulate-a-slow-block-device-with-dm-delay.html?utm_source=chatgpt.com I have https://github.com/prometheus/prometheus/pull/16339 that emulates size restricted FS for another reproducer, it can help with how to set up such an env.

Jun 01 '25 11:06 machine424

Hi @machine424,

Thanks for your suggestion, let me try to reproduce the issue with the latest Prometheus version

Jun 01 '25 12:06 lvphucitus

A quick check of the code tells me that we haven't really improved the checks in the relevant code area between v2.55 and the current release. I would expect the same error to happen. Question is if it is something that we can actually check for. There are a bunch of checks in place in tsdb/chunks/head_chunks.go - you can check the lines before L753, creating CorruptionErr. So we needed to understand better what's going wrong here to then see if we can check that before accessing the pages in question.

Jun 04 '25 15:06 beorn7

Hi @machine424,

We have just tried to reproduce the issue with the latest prometheus version (v3.4.1). the backtrace is here: new_backtrace.txt

Hi @beorn7, I also checked something about prometheus behaviors, I guess these are what prometheus does:

1.Prometheus scrapes metrics → stores them in memSeries (RAM) 2.When a chunk is full (or after ~2 hours) → the chunk is "cut" 3.The chunk is written to the WAL and chunk file (chunks_head/) 4.The memChunk in RAM is replaced with a mappedChunk (metadata + ref)

If I/O hangs during (3) is running:

The chunk may be written to chunks_head/ but not yet fully flushed (fsync incomplete).
The ref points to a theoretically valid offset, but the actual data isn't available yet.

During a query → Prometheus uses chunkDiskMapper.Chunk(ref) to read the chunk from disk If Prometheus queries the chunk immediately, it reads from a not-yet-available memory region → SIGSEGV.

So I think prometheus should have one more check process before accessing the chunk reference.

Jun 10 '25 06:06 lvphucitus

Hi @machine424 and @beorn7, Do you have any comments?

Jun 16 '25 06:06 lvphucitus

Thanks for the details, having a test that reproduces the issue with what I shared in https://github.com/prometheus/prometheus/issues/16621#issuecomment-2927060022 or at least the steps to reproduce it manually will be really helpful.

I’m not questioning your diagnosis, but if you need my input, I’ll first need a way to reproduce the issue. I don’t currently have the bandwidth to chase the bug down myself, so any additional information to help reproduce it would help speed things up and make things easier.

Jun 17 '25 08:06 machine424

Hi @machine424 and @beorn7,

I have checked prometheus source code deeper and I see the issue happened at github.com/prometheus/prometheus/tsdb/chunks/head_chunks.go:753 that line is: chkEnc := mmapFile.byteSlice.Range(chkStart, chkStart+ChunkEncodingSize)[0]

The root cause may be the byteSlice in mmapFile.byteSlice was pointing to a memory-mapped file region that had been revoked or invalidated by the kernel due to the I/O stall (I/O hangs within 60 seconds). Although the slice length appeared valid, when Go tried to access slice[start:end], the kernel could not fulfill the page request → causing a SIGSEGV at the Go runtime level.

As @beorn7 mentioned before, yes, Prometheus already performs checks on offset bounds, CRC validation, file existence, etc. However, Go has no way to check if a memory-mapped region is still valid at the kernel level. So even with correct logic, accessing an invalidated memory region with .Range() → results in an unavoidable SIGSEGV.

Jun 19 '25 09:06 lvphucitus

Interesting. Thanks for the research.

Questions that come to mind:

How can we gain more confidence that this is really the problem here?
If so, is there really nothing we can do from within Go to catch the problem before segfaulting? (If we cannot do anything, this would be another reason to doubt mmap'ing in Go in general. There is also the more or less well-known reason that mmap'ing interferes really badly with the Go scheduler.)
How commonly does this happen? And under which circumstances? (If this happens only on systems in dire straits, we could maybe say that we also have a lot of other problems at that moment, and crashing Prometheus is not the worst one.)

Jun 19 '25 11:06 beorn7

Hi @beorn7,

How can we gain more confidence that this is really the problem here? -I see this is really an issue with prometheus because it can be reproduced How commonly does this happen? And under which circumstances? -sometimes I/O is unstable (I/O hangs) for a short period of time causes prometheus to crash -> this should not be happening, prometheus should have a mechanism to prevent or recover after the crash.

I packaged prometheus in a helm chart to deploy on kubernetes cluster. There are steps to reproduce the issue: -deploy prometheus and then scraping metrics from other applications with 15s scrape internal (we can use smaller scrape interval values and large number of time series to make the issue happen more often) -create a script with a while true loop to send queries to prometheus continuously -run Chaos Mesh (action: latency) to make prometheus storage hangs in 60 seconds -after 60s, the storage is back, check and ensure Prometheus still works as normal -run Chaos Mesh a second time -after 60s, the storage is back, but promtheus is crashed at this time more detail about Chaos Mesh tool: https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/

Jun 25 '25 02:06 lvphucitus

Chaos tools have the inherent problem that they might happily create conditions that are exceedingly unlikely to happen in practice, or they might direct our focus towards situations that are irrelevant in practice (e.g. what I sketched out above: if you only get into that situation if your system is toast anyway, there is not much gain in handling it).

SIGSEGV crashes of Prometheus seem to be rare from what I hear. And we don't know how many of the SIGSEGV crashes we see in the wild are actually triggered by IO latency. It is an important insight that you can trigger the SIGSEGV crash by artificially introducing IO latency, but it only proves it is possible but not that every SIGSEGV crash is neccessarily triggered by it.

If there were an easy way of handling this condition, we should of course do it, but it seems to be quite problematic. See next comment.

Jun 25 '25 11:06 beorn7

I tried to do a bit of research, but I'm not an expert in these things…

My understanding so far is that the Go runtime crashes the program by default if it receives a SIGSEGV. However, there is debug.SetPanicOnFault, which we could set for a Goroutine accessing mmap'd memory. The Go runtime triggers a panic rather than a crash, and we could handle the panic.

Concerns that come to mind (again with the disclaimer that I'm not an expert in these things):

Could we end up in a situation where we handle a panic that we should not have handled because it leaves the program in some undefined state or even just in bad shape where crashing and restarting would actually be the better remediation?
Even if the panic happens in the case as described here, is it safe to just stop the current query and return a query error to the user (which, I guess, would be the way to handle the panic)?
Are there other impacts of frequently calling bedug.SetPanicOnFault, like performance impact? It's called debug for a reason, I guess, maybe not to be used in production…

All of this has to be weighed against the impact of the problem. See above, if this is happening very rarely, we might cause more trouble than benefits by trying to handle it.

Jun 25 '25 12:06 beorn7

Hi @beorn7,

Thank you very much for your idea. Could I ask you a question that I see if I/O the hang 2 times in a row then the issue happens, but if I/O hangs only occur every 2 hours the problem will not happen again. May the reference be renewed or something like that after WAL replays?

Jul 01 '25 04:07 lvphucitus

Could I ask you a question that I see if I/O the hang 2 times in a row then the issue happens, but if I/O hangs only occur every 2 hours the problem will not happen again. May the reference be renewed or something like that after WAL replays?

That's a good question, to which I have no answer. :)

As said, not really an expert in this area. Maybe some actual expert is reading this and will enlighten us.

Jul 01 '25 14:07 beorn7

Hi @beorn7

Thank you very much for your help from your comments before that has given me more knowledge about prometheus. However, Could you please let me know who is the expert you mentioned?

Jul 07 '25 06:07 lvphucitus

Could you please let me know who is the expert you mentioned?

I don't know. I'm just hoping an expert is reading this and can chime in.

Jul 08 '25 17:07 beorn7

if I/O hangs only occur every 2 hours the problem will not happen again. May the reference be renewed or something like that after WAL replays?

when a new chunk is written every 2hours, we create a new mmap ref, and that might explain why it doesn't break? i haven't reproduced the bug yet, but is there consensus that a better way to handle this is via capturing the SIGSEGV and either unmmap + mmap the file, or gracefully shutting down prometheus?

Jul 13 '25 11:07 darshanime

Once we reproduce the bug, the ideal fix would be to avoid the SIGSEGV in the first place.

Jul 18 '25 09:07 machine424

I packaged prometheus in a helm chart to deploy on kubernetes cluster. There are steps to reproduce the issue: -deploy prometheus and then scraping metrics from other applications with 15s scrape internal (we can use smaller scrape interval values and large number of time series to make the issue happen more often) -create a script with a while true loop to send queries to prometheus continuously -run Chaos Mesh (action: latency) to make prometheus storage hangs in 60 seconds -after 60s, the storage is back, check and ensure Prometheus still works as normal -run Chaos Mesh a second time -after 60s, the storage is back, but promtheus is crashed at this time more detail about Chaos Mesh tool: https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/

Hi, I have provided the steps to reproduce the issue above. Currently, I have no way to cover the issue

Jul 24 '25 14:07 lvphucitus

Hello from the bug scrub!

This seems to be low priority from maintainer side because of the low frequency of reports outside of artificially created scenarios. Also there's a current debate whether to even keep mmap at all.

Oct 28 '25 11:10 krajorama

Still, if somebody with a good understanding of what's going on here is able to provide a fix, we'd appreciate it.

Oct 28 '25 12:10 beorn7