prometheus icon indicating copy to clipboard operation
prometheus copied to clipboard

Flaky TestAsyncRuleEvaluation

Open bboreham opened this issue 6 months ago • 5 comments

Failed here: https://github.com/prometheus/prometheus/actions/runs/7797756253/job/21265037848?pr=13545

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
    manager_test.go:1961: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:1961
        	Error:      	"0.0655291" is not less than "0.06"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
--- FAIL: TestAsyncRuleEvaluation (0.18s)
    --- FAIL: TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency (0.07s)

Passed on re-run: https://github.com/prometheus/prometheus/actions/runs/7797756253/job/21265977651?pr=13545

Also failed here: https://github.com/prometheus/prometheus/actions/runs/7725797136/job/21060788152?pr=13197

bboreham avatar Feb 06 '24 10:02 bboreham

I think creating more rules for evaluation in fixtures/rules_multiple_independent.yaml might fix the flakeyness, since the time saved with concurrent evaluations would become more apparent. wdyt?

darshanime avatar Feb 06 '24 12:02 darshanime

I haven't looked at the test at all. Since it's only flaking on Windows, could it be the timer resolution again?

bboreham avatar Feb 06 '24 16:02 bboreham

TIL about windows timer resolution being ~15ms!

My hypothesis is that the concurrency does not get to "kick in" with just 6 rules, and so if we increase the number of rules to evaluate, the time savings due to concurrency will get more marked.

check https://github.com/prometheus/prometheus/pull/13553

darshanime avatar Feb 07 '24 10:02 darshanime

I just got this on my desktop Linux VM:

$ go test -tags stringlabels ./rules/
    --- FAIL: TestAsyncRuleEvaluation/asynchronous_evaluation_with_independent_and_dependent_rules (0.06s)
        manager_test.go:2391: 
                Error Trace:    /home/vagrant/src/github.com/grafana/mimir-prometheus/rules/manager_test.go:2391
                Error:          "0.056689165" is not less than "0.04"
                Test:           TestAsyncRuleEvaluation/asynchronous_evaluation_with_independent_and_dependent_rules

bboreham avatar Feb 07 '24 15:02 bboreham

Huh, that test is flakey too? I think the same technique will help there as well. Updated

darshanime avatar Feb 07 '24 16:02 darshanime

Flaked on https://github.com/prometheus/prometheus/actions/runs/8080900355/job/22078272296?pr=13662 (windows)

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_with_independent_and_dependent_rules
    manager_test.go:1926: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:1926
        	Error:      	"0.0411435" is not less than "0.04"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_with_independent_and_dependent_rules

and

https://github.com/prometheus/prometheus/actions/runs/8153560332/job/22285206086?pr=13706 (windows)

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_sufficient_concurrency
    manager_test.go:2000: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:2000
        	Error:      	"0.06309" is not less than "0.06"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_sufficient_concurrency

zenador avatar Feb 28 '24 13:02 zenador

Looking at the code again, it seems to rely on a 10ms Sleep being reasonably close to 10ms. https://github.com/prometheus/prometheus/blob/3b8b57700c469c7cde84e1d8f9d383cb8fe11ab0/rules/manager_test.go#L2075 https://github.com/prometheus/prometheus/blob/3b8b57700c469c7cde84e1d8f9d383cb8fe11ab0/rules/manager_test.go#L1926

This assumption is not safe: https://github.com/golang/go/issues/44343 (but will be safer in Go 1.23).

The timings in flake reports above don't match neatly to the 15ms value on Windows, so I guess something else is going wrong.

bboreham avatar May 13 '24 16:05 bboreham

Another example: https://github.com/prometheus/prometheus/actions/runs/9076113775/job/24938133905?pr=14098#step:4:13802

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
    manager_test.go:1962: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:1962
        	Error:      	"0.07915" is not less than "0.06"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency

zenador avatar May 14 '24 11:05 zenador

Unfortunately we're still seeing this on main

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
    manager_test.go:1963: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:1963
        	Error:      	"0.8259941" is not less than "0.09"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_sufficient_concurrency
--- FAIL: TestAsyncRuleEvaluation (1.01s)

https://github.com/prometheus/prometheus/actions/runs/9178564530/job/25238754426#step:4:14073

machine424 avatar May 21 '24 17:05 machine424

https://github.com/prometheus/prometheus/pull/14300 was merged for this.

machine424 avatar Jun 16 '24 14:06 machine424