mimir
mimir copied to clipboard
Flaky `TestRangeVectorSelectors`
https://github.com/grafana/mimir/actions/runs/10219303903/job/28277295499#step:8:76
--- FAIL: TestRangeVectorSelectors (0.07s)
--- FAIL: TestRangeVectorSelectors/histogram:_metric_with_stale_marker (0.00s)
--- FAIL: TestRangeVectorSelectors/histogram:_metric_with_stale_marker/Prometheus'_engine (0.00s)
histogram.go:115:
Error Trace: /__w/mimir/mimir/pkg/util/test/histogram.go:115
/__w/mimir/mimir/pkg/streamingpromql/engine_test.go:602
/__w/mimir/mimir/pkg/streamingpromql/engine_test.go:614
Error: Not equal:
expected: &histogram.FloatHistogram{CounterResetHint:0x0, Schema:0, ZeroThreshold:0, ZeroCount:0, Count:1, Sum:1, PositiveSpans:[]histogram.Span{histogram.Span{Offset:0, Length:0x2}}, NegativeSpans:[]histogram.Span(nil), PositiveBuckets:[]float64{1, 0}, NegativeBuckets:[]float64(nil), CustomValues:[]float64(nil)}
actual : &histogram.FloatHistogram{CounterResetHint:0x0, Schema:0, ZeroThreshold:0, ZeroCount:0, Count:1, Sum:1, PositiveSpans:[]histogram.Span{histogram.Span{Offset:0, Length:0x4}}, NegativeSpans:[]histogram.Span(nil), PositiveBuckets:[]float64{1, 0}, NegativeBuckets:[]float64(nil), CustomValues:[]float64(nil)}
Diff:
--- Expected
+++ Actual
@@ -10,3 +10,3 @@
Offset: (int32) 0,
- Length: (uint32) 2
+ Length: (uint32) 4
}
Test: TestRangeVectorSelectors/histogram:_metric_with_stale_marker/Prometheus'_engine
Messages: []
FAIL
This appears to be a bug in native histogram handling in Prometheus' PromQL engine. Given it only fails occasionally, I suspect a memory pooling issue somewhere.
I can reliably reproduce this failure locally with the following command in the pkg/streamingpromql directory: go test -run="TestRangeVectorSelectors|TestOurTestCases" -count=100
The failure occurs even if I disable all use of MQE by commenting it out in both TestRangeVectorSelectors and TestOurTestCases, and commenting out all test cases in TestRangeVectorSelectors except for the failing one mentioned above (histogram: metric with stale marker).
@krajorama would you be able to investigate this further?
Checked the stale marker and the tests inserts float number NaN as stale marker, not a histogram with sum==NaN. (This is valid.)
Loading 0 {count:1, sum:1, (0.5,1]:1}
Loading 60000 {count:2, sum:2, (0.5,1]:1, (1,2]:1}
Loading 120000 NaN
Loading 180000 {count:4, sum:4, (0.5,1]:1, (1,2]:1, (2,4]:1, (4,8]:1}
Also of note is the fact that the histogram at minute 3 (180000) has 4 buckets and at minute 1 (60000) has 2 buckets, which according to the failure output is the overwrite we're getting, as if the later histogram's spans overwrote the first histogram.
Seems like pointer reuse in the failing case:
Running test histogram: metric with stale marker
matrixIterSlice mint: 0 maxt: 180000
Loop floathistogram at 0
histogram #0 {count:1, sum:1, (0.5,1]:1} at 0 ptr=0xc00053a3c0
Loop floathistogram at 60000
histogram #1 {count:2, sum:2, (0.5,1]:1, (1,2]:1} at 60000 ptr=0xc000800000
Loop floathistogram at 120000
histogram #2 {count:0, sum:NaN} at 120000 ptr=0xc000800320
Loop none at 120000
soughtValueType floathistogram at 180000
histogram #2 {count:4, sum:4, (0.5,1]:1, (1,2]:1, (2,4]:1, (4,8]:1} at 180000 ptr=0xc000800320
Note 0xc000800320 *histogram.FloatHistogram.
Adding
histograms[n].H = nil
after lines
https://github.com/prometheus/prometheus/blob/e92a18b6ce295b99e1c089d6e2bc62fbe3f73869/promql/engine.go#L2317
and
https://github.com/prometheus/prometheus/blob/e92a18b6ce295b99e1c089d6e2bc62fbe3f73869/promql/engine.go#L2360
fixes it, but I'm not sure I understand why yet.
Mimir should get the fix here: https://github.com/grafana/mimir/pull/8938 , please retest after
I cannot reproduce this in the latest Mimir main.