fix: flaky TestSessionBetweenPeers with shuffle enabled
TLDR: fixes a racy test. This is likely just a problem with test's inability to fully test in isolation, rather a bug in the code.
cc @gammazero for sanity check
Why
Unsure if I fully understand the test/bitswap, so take this with the grain of salt, but iiuc the test was flaky (example) when run with -shuffle=on because of a race condition in message counting. After fetching the first block (cids[0]), which triggers a broadcast want-have to all peers and then a CANCEL when received, the test immediately continued fetching more blocks.
During this time, some other internal timers (idle tick or periodic search?) could fire and cause a rebroadcast of wants before the test checked message counts on uninvolved nodes. With test shuffling, goroutine scheduling changes made this timing issue more likely, causing uninvolved nodes to sometimes receive 3 messages instead of the expected 2.
Adding a small delay after the first block fetch ensures the CANCEL is fully processed and the session stabilizes before continuing, preventing the race condition.
(Not scientific, but run it hundreds times and was not able to reproduce flaky race anymore, so at least this PR makes our CI more stable across PRs)
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 60.70%. Comparing base (4c0aa3a) to head (27d0588).
@@ Coverage Diff @@
## main #1022 +/- ##
==========================================
- Coverage 60.76% 60.70% -0.07%
==========================================
Files 268 268
Lines 33593 33593
==========================================
- Hits 20414 20391 -23
- Misses 11508 11524 +16
- Partials 1671 1678 +7
see 7 files with indirect coverage changes
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I think there may be some other issues causing test unreliability.. It looks like CI is sometimes hanging forever (until timeout) or getting wrong count with this PR. Need further investigation.