Improve reliability of acceptance tests
Acceptance criteria would be that instead of acceptance tests being K-shot 100%, they all become consistently 1-shot 100%.
FINDINGS:
-
op-acceptance-tests/tests/interop/reorgspkg passes with-count=5, so not actionable at the moment. (test took2005.531s)
TODO:
- [ ] fix flaky pkg-level tests at
op-acceptance-tests/tests/interop/sync/multisupervisor_interop - [ ] fix flaky
TestL2CLAheadOfSupervisor - [x] fix flaky
TestUnsafeChainUnknownToL2CL-- https://github.com/ethereum-optimism/optimism/pull/16394 - [x] fix flaky pkg-level tests at
op-acceptance-tests/tests/interop/seqwindow--TestSequencingWindowExpiry-- fixed at https://github.com/ethereum-optimism/optimism/pull/16393 - [x] fix flaky pkg-level tests at
op-acceptance-tests/tests/interop/reorgs-- https://github.com/ethereum-optimism/optimism/pull/16415
09/06/2025, 05:28
TOP10 flakiest acceptance tests (by #flakes):
TestSequencingWindowExpiry (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/seqwindow) [104 flakes]
TestL2CLAheadOfSupervisor (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/sync/multisupervisor_interop) [97 flakes]
TestUnsafeChainUnknownToL2CL (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/sync/multisupervisor_interop) [29 flakes]
TestReorgInvalidExecMsgs/invalid_chain_id (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/reorgs) [21 flakes]
TestReorgInvalidExecMsgs/invalid_block_number (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/reorgs) [19 flakes]
TestUnsafeChainUnknownToL2CL (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/sync/redundant_interop) [19 flakes]
TestL2CLSyncP2P (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/sync/multisupervisor_interop) [15 flakes]
TestReorgUnsafeHead (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/reorgs) [14 flakes]
TestReorgInvalidExecMsgs/invalid_log_index (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/reorgs) [10 flakes]
TestLoad (github.com/ethereum-optimism/optimism/op-acceptance-tests/tests/interop/loadtest) [9 flakes]
Note that even if we measure the flakiness of single tests using
go test -v -count=1337 -run ^TestName$
This may not match the CI because some tests share the same environment, initialized by
func TestMain(m *testing.M) {
...
So for example sync tests located at multisupervisor_interop may interfere each other, boosting flakiness.
Good point @pcw109550, we should measure flakiness by package rather than by test.
TestL2CLAheadOfSupervisor passes with -count=5
The reorg package passes with -count=5.
Will continue to review the tests, but hopefully we also catch some useful logs from CircleCI.
In any case I think the same test with -count=5 is also useful indicator, as it will reuse the environment across runs, although I agree per-package runs increase interference.
Package op-acceptance-tests/tests/interop/sync/multisupervisor_interop seems to be flaky when all tests are run within the package, so I will be looking into it (TestUnsafeChainUnknownToL2CL and TestL2CLAheadOfSupervisor)
Also see the new "Flakiness Report", FYI https://github.com/ethereum-optimism/optimism/pull/16411
Note that we also include the "Job Name" here, which tells us which backend was used and may give us more clues as to where the flakyness arises.
Closing this as we improved tests considerable. We should continuously monitor CI and improve it as well as tests, and make sure we don't introduce too many flaky tests.