[#1628] Avoid exception caused by calling release multiple times
What changes were proposed in this pull request?
Set internal buffer to null after release.
Why are the changes needed?
address https://github.com/apache/incubator-uniffle/issues/1628#issuecomment-2182902051
Fix: #1628
Does this PR introduce any user-facing change?
No.
How was this patch tested?
added unit tests
Why would release be called multiple times?
Test Results
2 641 files ± 0 2 641 suites ±0 5h 27m 46s :stopwatch: -24s 946 tests + 2 945 :white_check_mark: + 3 1 :zzz: ±0 0 :x: - 1 11 803 runs +30 11 788 :white_check_mark: +31 15 :zzz: ±0 0 :x: - 1
Results for commit d1be2e4c. ± Comparison against base commit 1482804f.
Why would
releasebe called multiple times?
I didn't find out why it was called multiple times in #1628. Apparently, in https://github.com/apache/incubator-uniffle/blob/1482804f3bd365adce9af251a0a71ca463928083/storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java#L72, it checks whether shuffleIndexResult is not empty. One of my guesses is that buffer is cleaned up when netty client is closed.
Hmm, I'm not sure about this change.
If we can find the root cause why it's called multiple times and can reason that the multiple released is necessary, this change looks good to me.
However, we have not find the root cause. By simply avoid/reset the managedBuffer to null, we may hide a bigger problem and causing more subtle problems later.
Could we add more logging first to help debugging this issue? For example, we can log the double release in the releasing method with the stack trace?
We need to find out the root cause, which is better. Maybe it has something to do with Uniffle Hadoop related codes. Because this has only happened in RepartitionWithHadoopHybridStorageRssTest.
Thanks, that makes sense to me, I will try uploading GA failure logs for uniffle first.
Do we have any progress here? :)