mimir Flaky `TestRangeVectorSelectors`

https://github.com/grafana/mimir/actions/runs/10219303903/job/28277295499#step:8:76

--- FAIL: TestRangeVectorSelectors (0.07s)
    --- FAIL: TestRangeVectorSelectors/histogram:_metric_with_stale_marker (0.00s)
        --- FAIL: TestRangeVectorSelectors/histogram:_metric_with_stale_marker/Prometheus'_engine (0.00s)
            histogram.go:115: 
                	Error Trace:	/__w/mimir/mimir/pkg/util/test/histogram.go:115
                	            				/__w/mimir/mimir/pkg/streamingpromql/engine_test.go:602
                	            				/__w/mimir/mimir/pkg/streamingpromql/engine_test.go:614
                	Error:      	Not equal: 
                	            	expected: &histogram.FloatHistogram{CounterResetHint:0x0, Schema:0, ZeroThreshold:0, ZeroCount:0, Count:1, Sum:1, PositiveSpans:[]histogram.Span{histogram.Span{Offset:0, Length:0x2}}, NegativeSpans:[]histogram.Span(nil), PositiveBuckets:[]float64{1, 0}, NegativeBuckets:[]float64(nil), CustomValues:[]float64(nil)}
                	            	actual  : &histogram.FloatHistogram{CounterResetHint:0x0, Schema:0, ZeroThreshold:0, ZeroCount:0, Count:1, Sum:1, PositiveSpans:[]histogram.Span{histogram.Span{Offset:0, Length:0x4}}, NegativeSpans:[]histogram.Span(nil), PositiveBuckets:[]float64{1, 0}, NegativeBuckets:[]float64(nil), CustomValues:[]float64(nil)}
                	            	
                	            	Diff:
                	            	--- Expected
                	            	+++ Actual
                	            	@@ -10,3 +10,3 @@
                	            	    Offset: (int32) 0,
                	            	-   Length: (uint32) 2
                	            	+   Length: (uint32) 4
                	            	   }
                	Test:       	TestRangeVectorSelectors/histogram:_metric_with_stale_marker/Prometheus'_engine
                	Messages:   	[]
FAIL

Aug 02 '24 18:08 zenador

This appears to be a bug in native histogram handling in Prometheus' PromQL engine. Given it only fails occasionally, I suspect a memory pooling issue somewhere.

I can reliably reproduce this failure locally with the following command in the pkg/streamingpromql directory: go test -run="TestRangeVectorSelectors|TestOurTestCases" -count=100

The failure occurs even if I disable all use of MQE by commenting it out in both TestRangeVectorSelectors and TestOurTestCases, and commenting out all test cases in TestRangeVectorSelectors except for the failing one mentioned above (histogram: metric with stale marker).

@krajorama would you be able to investigate this further?

Aug 05 '24 03:08 charleskorn

Checked the stale marker and the tests inserts float number NaN as stale marker, not a histogram with sum==NaN. (This is valid.)

Loading 0 {count:1, sum:1, (0.5,1]:1}
Loading 60000 {count:2, sum:2, (0.5,1]:1, (1,2]:1}
Loading 120000 NaN
Loading 180000 {count:4, sum:4, (0.5,1]:1, (1,2]:1, (2,4]:1, (4,8]:1}

Also of note is the fact that the histogram at minute 3 (180000) has 4 buckets and at minute 1 (60000) has 2 buckets, which according to the failure output is the overwrite we're getting, as if the later histogram's spans overwrote the first histogram.

Aug 05 '24 06:08 krajorama

Seems like pointer reuse in the failing case:

Running test histogram: metric with stale marker
matrixIterSlice mint: 0 maxt: 180000
Loop floathistogram at 0
histogram #0 {count:1, sum:1, (0.5,1]:1} at 0  ptr=0xc00053a3c0
Loop floathistogram at 60000
histogram #1 {count:2, sum:2, (0.5,1]:1, (1,2]:1} at 60000  ptr=0xc000800000
Loop floathistogram at 120000
histogram #2 {count:0, sum:NaN} at 120000  ptr=0xc000800320
Loop none at 120000
soughtValueType floathistogram at 180000
histogram #2 {count:4, sum:4, (0.5,1]:1, (1,2]:1, (2,4]:1, (4,8]:1} at 180000  ptr=0xc000800320

Note 0xc000800320 *histogram.FloatHistogram.

Adding histograms[n].H = nil after lines https://github.com/prometheus/prometheus/blob/e92a18b6ce295b99e1c089d6e2bc62fbe3f73869/promql/engine.go#L2317 and https://github.com/prometheus/prometheus/blob/e92a18b6ce295b99e1c089d6e2bc62fbe3f73869/promql/engine.go#L2360 fixes it, but I'm not sure I understand why yet.

Aug 05 '24 07:08 krajorama

Mimir should get the fix here: https://github.com/grafana/mimir/pull/8938 , please retest after

Aug 08 '24 12:08 krajorama

I cannot reproduce this in the latest Mimir main.

Jan 28 '25 21:01 56quarters

mimir mimir copied to clipboard

Flaky `TestRangeVectorSelectors`

mimir
mimir copied to clipboard