bee icon indicating copy to clipboard operation
bee copied to clipboard

fix: prevent deadlock in BMT hasher causing test timeout

Open gacevicljubisa opened this issue 1 month ago • 1 comments

Checklist

  • [ ] I have read the coding guide.
  • [ ] My change requires a documentation update, and I have done it.
  • [ ] I have added tests to cover my changes.
  • [ ] I have filled out the description and linked the related issues.

Description

Fix deadlock issue in BMT (Binary Merkle Tree) hasher that was causing TestJoinerRedundancyMultilevel to timeout after 10 minutes.

The issue was caused by:

  1. Unbuffered result channel blocking when multiple goroutines try to send
  2. Race condition where multiple goroutines could reach root and attempt to send result simultaneously
  3. With large files (16384 chunks), many concurrent goroutines increased the probability of deadlock

Solution:

  • Buffer result channel (size 1) to allow at least one send to complete
  • Use sync.Once to ensure only one result is sent even if multiple goroutines reach the root
  • Use non-blocking select when sending to result channel
  • Properly reset sync.Once in Reset() method for hasher reuse

Fixes test timeout in pkg/file/joiner/joiner_test.go:TestJoinerRedundancyMultilevel

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

gacevicljubisa avatar Nov 04 '25 13:11 gacevicljubisa

@gacevicljubisa could you explain the deadlock to me? I read the PR description but I can't see how does the BMT hasher fit the problem description:

  1. If the BMT hasher hashes just one chunk, how does have a large file with multiple chunks affect the hasher? The hasher is called per chunk. The hashing of a joined file is handled in the hashing pipeline iirc
  2. in bmt.go, as far as I can see, only one goroutine is allowed to write into the result channel - L102 on master (since this is the only call to processSection with final==true). The only way that you may get a deadlock here is if Hash is called twice on the same data at the same time.
  3. In case this fix is indeed still necessary, I think that the sync.Once can be avoided here. If you make the channel have a buffer of 1, it means that the first write will always succeed. In which case, later attempts to write into the channel could just have a default case in a select (if the first write was successful - then there's something in the channel and we can return safely since one write has happened. if it was already read - we can write again, regardless if it would be read again). Also, since the BMT hashers are pooled, you must drain that channel before putting it back to the pool, otherwise if there was a value in the channel it may result in a corrupted bmt hash for a later chunk reusing that hasher.

acud avatar Nov 20 '25 16:11 acud