prometheus icon indicating copy to clipboard operation
prometheus copied to clipboard

Idea: add a mode where Prometheus will do a head compaction before starting scraping

Open bboreham opened this issue 1 year ago • 1 comments

Proposal

As @mattburgess said at https://github.com/prometheus/prometheus/issues/6934#issuecomment-945739611

would it be possible to run a compaction/WAL cleanup as soon as the replay has been completed? That would help us avoid the crash-loop we find ourselves in, whereby a pod restart causes WAL replay to happen, but then they never get cleaned up, so when the OOM-killer comes along the same WALs get replayed on the next restart and on and on we go in circles until we resign ourselves to removing the WALs and thereby losing data.

I found myself wanting this today.

I will add, I think it is important that we not start scraping until after this head compaction has finished, to keep memory usage down and avoid making the WAL any bigger, because if Prometheus OOMs again it will restart with a worse problem.

It could be a CLI flag to Prometheus, like --force-head-compaction-at-start.

bboreham avatar Sep 14 '22 15:09 bboreham

Related idea: #7575, #7939.

bboreham avatar Sep 14 '22 16:09 bboreham

Sounds good to me, but what would be the effect if someone uses this routinely? Would it still create non overlapping aligned blocks or would that create new blocks every time?

roidelapluie avatar Jan 23 '23 13:01 roidelapluie

During research for #12286 I realised that doing WAL checkpoint and truncation to the end of the last real block, before doing anything else, will be a benefit. Perhaps also scan the WAL to see which series have samples after that time, so that the checkpoint can be built with the minimum set of series.

bboreham avatar Apr 23 '23 15:04 bboreham