mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Test flake: TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling

Open seizethedave opened this issue 9 months ago • 6 comments

Describe the bug

This test failed for me in github CI and succeeded on rerun. I was working on something unrelated in a branch whose parent commit on main was 5e93125923e94718d9ec1673fae0d3d6a65e0686.

19:19:22 Started distributor
=== RUN   TestDistributor/caching_unmarshal_data_disabled/wrong_labels_order
=== RUN   TestDistributor/caching_unmarshal_data_disabled/dropped_labels
19:19:23 querier: ts=2024-05-06T19:19:23.811704501Z caller=loader.go:120 level=warn msg="bucket index not found" user=e2e-user
=== RUN   TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling
    distributor_test.go:361: 
        	Error Trace:	/home/runner/work/mimir/mimir/integration/distributor_test.go:361
        	Error:      	Not equal: 
        	            	expected: "histogram_down_scaling_series =>\nCount: 12.000000, Sum: 18.400000, Buckets: [[-16,-4):2 [-4,-1):2 [-1,-0.25):1 [-0.001,0.001]:2 (0.25,1]:1 (1,4]:2 (4,16]:2] @[[171](https://github.com/grafana/mimir/actions/runs/8974460472/job/24647018591?pr=8068#step:15:172)5019559]"
        	            	actual  : "histogram_down_scaling_series =>\nCount: 12.000000, Sum: 18.400000, Buckets: [[-16,-8):1 [-8,-4):1 [-2,-1):2 [-1,-0.5):1 [-0.001,0.001]:2 (0.5,1]:1 (1,2]:2 (4,8]:1 (8,16]:1] @[1715019559]"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	 histogram_down_scaling_series =>
        	            	-Count: 12.000000, Sum: 18.400000, Buckets: [[-16,-4):2 [-4,-1):2 [-1,-0.25):1 [-0.001,0.001]:2 (0.25,1]:1 (1,4]:2 (4,16]:2] @[1715019559]
        	            	+Count: 12.000000, Sum: 18.400000, Buckets: [[-16,-8):1 [-8,-4):1 [-2,-1):2 [-1,-0.5):1 [-0.001,0.001]:2 (0.5,1]:1 (1,2]:2 (4,8]:1 (8,16]:1] @[1715019559]
        	Test:       	TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling

seizethedave avatar May 06 '24 20:05 seizethedave

The code that does the down-scaling is deterministic, so the only thing I can think of is the loading of the runtime config that may go wrong. I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active? https://github.com/grafana/dskit/blob/a1bba1277f06b41ff14cf54e5d4d31aebc92493b/runtimeconfig/manager.go#L209

krajorama avatar May 10 '24 08:05 krajorama

Reproduced with this diff:

diff --git a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
index 84b69de76..e43c136a5 100644
--- a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
+++ b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
@@ -206,6 +206,7 @@ func (om *Manager) loadConfig() error {
        }
        om.configLoadSuccess.Set(1)
 
+       time.Sleep(1 * time.Second) // give listeners time to register
        om.setConfig(cfg)
        om.callListeners(cfg)

krajorama avatar May 10 '24 09:05 krajorama

I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active?

~I think you are onto it. I see two possibilities:~

  1. ~The single goroutine per module isn't done processing the updated config.~
  2. ~(a further symptom of number 1) If the reader side is behind on processing updates, the write side will discard an update.~

These would cause inconsistent updates among listeners, but I don't think these would cause the tests to fail, because I don't think the listeners are in the "update path" of runtime config values like native_histograms_ingestion_enabled.

seizethedave avatar May 13 '24 16:05 seizethedave

{waking up on this Monday morning 😸.}

But the race with the second of sleep you highlighted would also cause it.

seizethedave avatar May 13 '24 16:05 seizethedave

I guess your second of sleep just takes longer than the tests care to wait for the new config to be available. 1 second, while an eternity, could also be a source of flakes. I'll just open a PR to crank the wait time to something like ten seconds.

seizethedave avatar May 13 '24 17:05 seizethedave

Actually, I don't think that was the problem. Here's the run that failed: https://github.com/grafana/mimir/actions/runs/8974460472/job/24647018591

test.Poll will t.Fatal if the condition isn't met after the 1 second, and that didn't happen on that run.

seizethedave avatar May 13 '24 17:05 seizethedave