boxo icon indicating copy to clipboard operation
boxo copied to clipboard

fix: flaky TestSessionBetweenPeers with shuffle enabled

Open lidel opened this issue 4 months ago • 2 comments

TLDR: fixes a racy test. This is likely just a problem with test's inability to fully test in isolation, rather a bug in the code.

cc @gammazero for sanity check

Why

Unsure if I fully understand the test/bitswap, so take this with the grain of salt, but iiuc the test was flaky (example) when run with -shuffle=on because of a race condition in message counting. After fetching the first block (cids[0]), which triggers a broadcast want-have to all peers and then a CANCEL when received, the test immediately continued fetching more blocks.

During this time, some other internal timers (idle tick or periodic search?) could fire and cause a rebroadcast of wants before the test checked message counts on uninvolved nodes. With test shuffling, goroutine scheduling changes made this timing issue more likely, causing uninvolved nodes to sometimes receive 3 messages instead of the expected 2.

Adding a small delay after the first block fetch ensures the CANCEL is fully processed and the session stabilizes before continuing, preventing the race condition.

(Not scientific, but run it hundreds times and was not able to reproduce flaky race anymore, so at least this PR makes our CI more stable across PRs)

lidel avatar Sep 01 '25 14:09 lidel

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 60.70%. Comparing base (4c0aa3a) to head (27d0588).

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1022      +/-   ##
==========================================
- Coverage   60.76%   60.70%   -0.07%     
==========================================
  Files         268      268              
  Lines       33593    33593              
==========================================
- Hits        20414    20391      -23     
- Misses      11508    11524      +16     
- Partials     1671     1678       +7     

see 7 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 01 '25 14:09 codecov[bot]

I think there may be some other issues causing test unreliability.. It looks like CI is sometimes hanging forever (until timeout) or getting wrong count with this PR. Need further investigation.

gammazero avatar Sep 03 '25 01:09 gammazero