☂ Fix flaky tests
Some tests just randomly fail in the CI.
Known bad with last confirmation date:
- [ ]
polkadot-node-core-pvf::it execute_queue_doesnt_stall_with_varying_executor_params2025.11.11 - [ ]
follow_report_multiple_pruned_blockhttps://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2540531 - [ ]
follow_forks_pruned_blockhttps://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2529507 - [x]
returns_status_for_pruned_blockshttps://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2522151 - [ ]
can_sync_small_non_best_forkshttps://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2635997 - [ ]
telemetry_workson commit 890451221db37176e13cb1a306246f02de80590aassertion failed: self.wait().unwrap().success()(see comment) - [x]
beefy_reports_equivocationsCI 2023-06-14 - fix https://github.com/paritytech/substrate/pull/14382 - [ ]
response_headers_invalid_callCI 2023-05-29 - [x]
ensure_parallel_execution:Expected duration 6697ms to be less than 6000ms(2023-06-13) - fix paritytech/polkadot#7390 - [x]
execute_queue_doesnt_stall_with_varying_executor_params:Expected duration 12912ms to be less than or equal to 12000ms(2023-06-13) - fix paritytech/polkadot#7390 - [ ]
sc-offchain api::http::tests::response_header_invalid_callCI log (2023-06-16)
Maybe flaky, maybe fixed :ghost::
- [ ]
running_the_node_works_and_can_be_interruptederror-log.txt 8a9f48bcf0c9f92949082535d77c12166522bb2f - [ ]
notifications_back_pressurehttps://github.com/paritytech/polkadot-sdk/issues/537
Fixed:
- [x]
temp_base_path_workshttps://github.com/paritytech/substrate/pull/13505 error-log.txt - [x]
subscribe_and_unsubscribe_to_justifications - [x]
syncs_header_only_forkshttps://github.com/paritytech/substrate/issues/12607 - [x] babe
authoring_blockshttps://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2101500 https://github.com/paritytech/substrate/pull/13199
Ok, I see.
This might be quite tricky find "free" ports to use for libp2p.
A first step would be to ensure that the CLI tests assigns unique ports for libp2p.
The babe test authoring_blocks failed again.
tests::authoring_blocks' panicked at 'importing block failed: ClientImport("Slot number must increase: parent slot: 1669811014, this slot: 1669811014")
telemetry_works seems to be another flaky test.
When running:
while cargo test --release -p node-cli --test telemetry; do true; done
The test will fail at some point. When adding some more "debugging" the following error is shown:
[test-utils/cli/src/lib.rs:236] self.wait().unwrap() = ExitStatus(
unix_wait_status(
139,
),
)
This indicates at the spawned Substrate process is dying because of some segmentation fault. I assume the underlying problem is not related to telemetry, as it happens on shutting down the node. (Maybe still related to telemetry and only happens because the worker is doing something that it shouldn't be doing)
telemetry_worksseems to be another flaky test.
Confirmed and added. Should we comment it until fixed?
Confirmed and added. Should we comment it until fixed?
I don't have seen it in CI so far, only locally on my machine. Also in debug it didn't seemed to be reproducible, so maybe on slower machines or whatever it isn't a problem. I would like to keep it there until we have seen reports of it failing in CI.
Ping: Two more tests added; beefy_reports_equivocations and response_headers_invalid_call.
Not sure who to ping for ensure_parallel_execution and execute_queue_doesnt_stall_with_varying_executor_params.
Git shows that @mrcnski and @s0me0ne-unkn0wn have worked with the code, maybe one of you?
@ggwpez both tests are driven by calculated timeouts which is flaky by nature, we just didn't expect CI runners to have such a significant divergence in performance :/
I'll look into it.
This case it is not about CI runners, it failed on Gav's PC.
In general I dont know if we can assert timeouts without running it on fixed hardware. Maybe just remove those checks?
Or only run that last timing check when a CI Env variable is present.
We could get rid of them, of course, but we would still want to check somehow that queues are behaving as expected, that is, that they are running jobs in parallel, not sequentially, and that they can kill workers and spawn new ones depending on conditions, and that's hardly achievable if not relying on timeouts. To only run them in CI sounds like a legit idea. Maybe limiting them to the testnet profile is enough?
Maybe limiting them to the
testnetprofile is enough?
We don't compile test with this profile. We could add some special env variable. However, while looking at the test, could we not just spawn both invocations and check if the test process has started two child processes (the workers)?
We don't compile test with this profile
I believe we do (at least for Polkadot): https://github.com/paritytech/polkadot/blob/master/scripts/ci/gitlab/pipeline/test.yml#L44
Even that does not guarantee that we are actually in CI. Only an env var would, and should be easy to implement.
I'd also add sc-network discovery::tests::discovery_working because It looks as if it sometimes just hangs: https://github.com/paritytech/polkadot-sdk/actions/runs/15180086844/job/42701594693
Also @kianenigma mentioned these ones:
rejects_missing_seals
sync_to_tip_when_we_sync_together_with_multiple_peers
works_with_different_block_times
And another one: fixedsc-service-test client::returns_status_for_pruned_blocks https://github.com/paritytech/polkadot-sdk/actions/runs/15187569760/job/42711817408
We need to create sub issues and assign people to them. Or they will never get fixed.
ensure_operation_limits_works: https://github.com/paritytech/polkadot-sdk/actions/runs/15294554712/job/43020843814#step:4:4001
A new one potentially found in https://github.com/paritytech/polkadot-sdk/actions/runs/19265662309/job/55081001166#step:4:5020: