mimir
mimir copied to clipboard
Test flake: TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling
Describe the bug
This test failed for me in github CI and succeeded on rerun. I was working on something unrelated in a branch whose parent commit on main
was 5e93125923e94718d9ec1673fae0d3d6a65e0686.
19:19:22 Started distributor
=== RUN TestDistributor/caching_unmarshal_data_disabled/wrong_labels_order
=== RUN TestDistributor/caching_unmarshal_data_disabled/dropped_labels
19:19:23 querier: ts=2024-05-06T19:19:23.811704501Z caller=loader.go:120 level=warn msg="bucket index not found" user=e2e-user
=== RUN TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling
distributor_test.go:361:
Error Trace: /home/runner/work/mimir/mimir/integration/distributor_test.go:361
Error: Not equal:
expected: "histogram_down_scaling_series =>\nCount: 12.000000, Sum: 18.400000, Buckets: [[-16,-4):2 [-4,-1):2 [-1,-0.25):1 [-0.001,0.001]:2 (0.25,1]:1 (1,4]:2 (4,16]:2] @[[171](https://github.com/grafana/mimir/actions/runs/8974460472/job/24647018591?pr=8068#step:15:172)5019559]"
actual : "histogram_down_scaling_series =>\nCount: 12.000000, Sum: 18.400000, Buckets: [[-16,-8):1 [-8,-4):1 [-2,-1):2 [-1,-0.5):1 [-0.001,0.001]:2 (0.5,1]:1 (1,2]:2 (4,8]:1 (8,16]:1] @[1715019559]"
Diff:
--- Expected
+++ Actual
@@ -1,2 +1,2 @@
histogram_down_scaling_series =>
-Count: 12.000000, Sum: 18.400000, Buckets: [[-16,-4):2 [-4,-1):2 [-1,-0.25):1 [-0.001,0.001]:2 (0.25,1]:1 (1,4]:2 (4,16]:2] @[1715019559]
+Count: 12.000000, Sum: 18.400000, Buckets: [[-16,-8):1 [-8,-4):1 [-2,-1):2 [-1,-0.5):1 [-0.001,0.001]:2 (0.5,1]:1 (1,2]:2 (4,8]:1 (8,16]:1] @[1715019559]
Test: TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling
The code that does the down-scaling is deterministic, so the only thing I can think of is the loading of the runtime config that may go wrong. I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active? https://github.com/grafana/dskit/blob/a1bba1277f06b41ff14cf54e5d4d31aebc92493b/runtimeconfig/manager.go#L209
Reproduced with this diff:
diff --git a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
index 84b69de76..e43c136a5 100644
--- a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
+++ b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
@@ -206,6 +206,7 @@ func (om *Manager) loadConfig() error {
}
om.configLoadSuccess.Set(1)
+ time.Sleep(1 * time.Second) // give listeners time to register
om.setConfig(cfg)
om.callListeners(cfg)
I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active?
~I think you are onto it. I see two possibilities:~
- ~The single goroutine per module isn't done processing the updated config.~
- ~(a further symptom of number 1) If the reader side is behind on processing updates, the write side will discard an update.~
These would cause inconsistent updates among listeners, but I don't think these would cause the tests to fail, because I don't think the listeners are in the "update path" of runtime config values like native_histograms_ingestion_enabled
.
{waking up on this Monday morning 😸.}
But the race with the second of sleep you highlighted would also cause it.
I guess your second of sleep just takes longer than the tests care to wait for the new config to be available. 1 second, while an eternity, could also be a source of flakes. I'll just open a PR to crank the wait time to something like ten seconds.
Actually, I don't think that was the problem. Here's the run that failed: https://github.com/grafana/mimir/actions/runs/8974460472/job/24647018591
test.Poll will t.Fatal if the condition isn't met after the 1 second, and that didn't happen on that run.