bee
bee copied to clipboard
fix: prevent deadlock in BMT hasher causing test timeout
Checklist
- [ ] I have read the coding guide.
- [ ] My change requires a documentation update, and I have done it.
- [ ] I have added tests to cover my changes.
- [ ] I have filled out the description and linked the related issues.
Description
Fix deadlock issue in BMT (Binary Merkle Tree) hasher that was causing TestJoinerRedundancyMultilevel to timeout after 10 minutes.
The issue was caused by:
- Unbuffered result channel blocking when multiple goroutines try to send
- Race condition where multiple goroutines could reach root and attempt to send result simultaneously
- With large files (16384 chunks), many concurrent goroutines increased the probability of deadlock
Solution:
- Buffer result channel (size 1) to allow at least one send to complete
- Use sync.Once to ensure only one result is sent even if multiple goroutines reach the root
- Use non-blocking select when sending to result channel
- Properly reset sync.Once in Reset() method for hasher reuse
Fixes test timeout in pkg/file/joiner/joiner_test.go:TestJoinerRedundancyMultilevel
Open API Spec Version Changes (if applicable)
Motivation and Context (Optional)
Related Issue (Optional)
Screenshots (if appropriate):
@gacevicljubisa could you explain the deadlock to me? I read the PR description but I can't see how does the BMT hasher fit the problem description:
- If the BMT hasher hashes just one chunk, how does have a large file with multiple chunks affect the hasher? The hasher is called per chunk. The hashing of a joined file is handled in the hashing pipeline iirc
- in
bmt.go, as far as I can see, only one goroutine is allowed to write into the result channel -L102onmaster(since this is the only call toprocessSectionwithfinal==true). The only way that you may get a deadlock here is ifHashis called twice on the same data at the same time. - In case this fix is indeed still necessary, I think that the
sync.Oncecan be avoided here. If you make the channel have a buffer of 1, it means that the first write will always succeed. In which case, later attempts to write into the channel could just have adefaultcase in aselect(if the first write was successful - then there's something in the channel and we can return safely since one write has happened. if it was already read - we can write again, regardless if it would be read again). Also, since the BMT hashers are pooled, you must drain that channel before putting it back to the pool, otherwise if there was a value in the channel it may result in a corrupted bmt hash for a later chunk reusing that hasher.