cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Investigating the reason out of order samples are appearing in LTS blocks

Open shybbko opened this issue 2 years ago • 2 comments

I'm running two config-identical GKE Cortex clusters (Cortex 1.11, GKE 1.21). Those are aggregating logs from several sources, you could say we're talking nonprod & prod cluster. Each cluster holds several TB of metrics. In nonprod cluster I have 0 blocks with out of order series. In prod cluster - over 20.

I've only noticed this upon those blocks being skipped from compaction due to containing OoO samples.

So far the out of order blocks are "grouped" around seven timestamps between January and March. The clusters have been online for much longer, but between December and January I was migrating from chunks to blocks (not sure whether relevant). The faulty blocks are either "even" (ie. 4 to 6 pm) or not (ie. 4:01:13 to 6 pm). So far I don't recall any particular events taking place around those timestamps. Each faulty block contains between 1 and 13 out of order series. Also I don't see any obvious correlation between those OoO samples (like job, source node etc.), they appear random.

I wanted to investigate the reasons for:

  • the blocks with OoO series appearing at all
  • the blocks appearing in the "prod" cluster only

So far I came up with three possible options, but cannot neither confirm nor deny any of them being the cause:

  • Prometheus race condition issue? https://github.com/prometheus/prometheus/issues/9879 + https://github.com/cortexproject/cortex/issues/4573
  • Cortex is rejecting some out of order samples on daily basis and IMO this number is significant (at least higher than I wanted to to be). While Cortex is designed not to let trough any OoO sample, maybe it sometimes just slips one? In prod I've got about 500M OoO samples rejected total (around 0.02% of traffic), in nonprod it's 5M total (around 0.0001%).
  • Not clean deployment / rollout in K8s. By "not clean" I mean Cortex ingesters might be getting forcibly shut down and writing metrics into LTS in a inproper manner.

Any ideas, hints, suggestions? Ideally I wanted to assure there is not a single new block containing out of order series in the future.

shybbko avatar Apr 25 '22 15:04 shybbko

We see some out of orders samples as well sometimes and its not super easy to root cause the problem.

We are currently testing https://github.com/prometheus/prometheus/pull/10624 to see if it will fix the issue on our case.

alanprot avatar May 04 '22 22:05 alanprot

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 12 '22 01:08 stale[bot]