polkadot-sdk icon indicating copy to clipboard operation
polkadot-sdk copied to clipboard

☂ Fix flaky tests

Open ggwpez opened this issue 3 years ago • 19 comments

Some tests just randomly fail in the CI.

Known bad with last confirmation date:

  • [ ] polkadot-node-core-pvf::it execute_queue_doesnt_stall_with_varying_executor_params 2025.11.11
  • [ ] follow_report_multiple_pruned_block https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2540531
  • [ ] follow_forks_pruned_block https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2529507
  • [x] returns_status_for_pruned_blocks https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2522151
  • [ ] can_sync_small_non_best_forks https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2635997
  • [ ] telemetry_works on commit 890451221db37176e13cb1a306246f02de80590a assertion failed: self.wait().unwrap().success() (see comment)
  • [x] beefy_reports_equivocations CI 2023-06-14 - fix https://github.com/paritytech/substrate/pull/14382
  • [ ] response_headers_invalid_call CI 2023-05-29
  • [x] ensure_parallel_execution: Expected duration 6697ms to be less than 6000ms (2023-06-13) - fix paritytech/polkadot#7390
  • [x] execute_queue_doesnt_stall_with_varying_executor_params: Expected duration 12912ms to be less than or equal to 12000ms (2023-06-13) - fix paritytech/polkadot#7390
  • [ ] sc-offchain api::http::tests::response_header_invalid_call CI log (2023-06-16)

Maybe flaky, maybe fixed :ghost::

  • [ ] running_the_node_works_and_can_be_interrupted error-log.txt 8a9f48bcf0c9f92949082535d77c12166522bb2f
  • [ ] notifications_back_pressure https://github.com/paritytech/polkadot-sdk/issues/537

Fixed:

  • [x] temp_base_path_works https://github.com/paritytech/substrate/pull/13505 error-log.txt
  • [x] subscribe_and_unsubscribe_to_justifications
  • [x] syncs_header_only_forks https://github.com/paritytech/substrate/issues/12607
  • [x] babe authoring_blocks https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2101500 https://github.com/paritytech/substrate/pull/13199

ggwpez avatar Apr 29 '22 10:04 ggwpez

Ok, I see.

This might be quite tricky find "free" ports to use for libp2p. A first step would be to ensure that the CLI tests assigns unique ports for libp2p.

niklasad1 avatar Jun 02 '22 13:06 niklasad1

The babe test authoring_blocks failed again.
tests::authoring_blocks' panicked at 'importing block failed: ClientImport("Slot number must increase: parent slot: 1669811014, this slot: 1669811014")

ggwpez avatar Jan 20 '23 00:01 ggwpez

telemetry_works seems to be another flaky test.

When running:

while cargo test --release -p node-cli --test telemetry; do true; done

The test will fail at some point. When adding some more "debugging" the following error is shown:

[test-utils/cli/src/lib.rs:236] self.wait().unwrap() = ExitStatus(
    unix_wait_status(
        139,
    ),
)

This indicates at the spawned Substrate process is dying because of some segmentation fault. I assume the underlying problem is not related to telemetry, as it happens on shutting down the node. (Maybe still related to telemetry and only happens because the worker is doing something that it shouldn't be doing)

bkchr avatar May 02 '23 14:05 bkchr

telemetry_works seems to be another flaky test.

Confirmed and added. Should we comment it until fixed?

ggwpez avatar May 02 '23 14:05 ggwpez

Confirmed and added. Should we comment it until fixed?

I don't have seen it in CI so far, only locally on my machine. Also in debug it didn't seemed to be reproducible, so maybe on slower machines or whatever it isn't a problem. I would like to keep it there until we have seen reports of it failing in CI.

bkchr avatar May 02 '23 15:05 bkchr

Ping: Two more tests added; beefy_reports_equivocations and response_headers_invalid_call.

ggwpez avatar Jun 05 '23 14:06 ggwpez

Not sure who to ping for ensure_parallel_execution and execute_queue_doesnt_stall_with_varying_executor_params.
Git shows that @mrcnski and @s0me0ne-unkn0wn have worked with the code, maybe one of you?

ggwpez avatar Jun 13 '23 18:06 ggwpez

@ggwpez both tests are driven by calculated timeouts which is flaky by nature, we just didn't expect CI runners to have such a significant divergence in performance :/

I'll look into it.

s0me0ne-unkn0wn avatar Jun 13 '23 20:06 s0me0ne-unkn0wn

This case it is not about CI runners, it failed on Gav's PC.
In general I dont know if we can assert timeouts without running it on fixed hardware. Maybe just remove those checks? Or only run that last timing check when a CI Env variable is present.

ggwpez avatar Jun 13 '23 20:06 ggwpez

We could get rid of them, of course, but we would still want to check somehow that queues are behaving as expected, that is, that they are running jobs in parallel, not sequentially, and that they can kill workers and spawn new ones depending on conditions, and that's hardly achievable if not relying on timeouts. To only run them in CI sounds like a legit idea. Maybe limiting them to the testnet profile is enough?

s0me0ne-unkn0wn avatar Jun 13 '23 20:06 s0me0ne-unkn0wn

Maybe limiting them to the testnet profile is enough?

We don't compile test with this profile. We could add some special env variable. However, while looking at the test, could we not just spawn both invocations and check if the test process has started two child processes (the workers)?

bkchr avatar Jun 14 '23 08:06 bkchr

We don't compile test with this profile

I believe we do (at least for Polkadot): https://github.com/paritytech/polkadot/blob/master/scripts/ci/gitlab/pipeline/test.yml#L44

s0me0ne-unkn0wn avatar Jun 14 '23 09:06 s0me0ne-unkn0wn

Even that does not guarantee that we are actually in CI. Only an env var would, and should be easy to implement.

ggwpez avatar Jun 14 '23 09:06 ggwpez

I'd also add sc-network discovery::tests::discovery_working because It looks as if it sometimes just hangs: https://github.com/paritytech/polkadot-sdk/actions/runs/15180086844/job/42701594693

alvicsam avatar May 22 '25 12:05 alvicsam

Also @kianenigma mentioned these ones:

rejects_missing_seals
sync_to_tip_when_we_sync_together_with_multiple_peers
works_with_different_block_times  

alvicsam avatar May 22 '25 12:05 alvicsam

And another one: sc-service-test client::returns_status_for_pruned_blocks https://github.com/paritytech/polkadot-sdk/actions/runs/15187569760/job/42711817408 fixed

alvicsam avatar May 22 '25 14:05 alvicsam

We need to create sub issues and assign people to them. Or they will never get fixed.

athei avatar May 28 '25 10:05 athei

ensure_operation_limits_works: https://github.com/paritytech/polkadot-sdk/actions/runs/15294554712/job/43020843814#step:4:4001

alindima avatar May 28 '25 12:05 alindima

A new one potentially found in https://github.com/paritytech/polkadot-sdk/actions/runs/19265662309/job/55081001166#step:4:5020:

kianenigma avatar Nov 11 '25 13:11 kianenigma