incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[#1628] Avoid exception caused by calling release multiple times

Open wForget opened this issue 1 year ago • 7 comments

What changes were proposed in this pull request?

Set internal buffer to null after release.

Why are the changes needed?

address https://github.com/apache/incubator-uniffle/issues/1628#issuecomment-2182902051

Fix: #1628

Does this PR introduce any user-facing change?

No.

How was this patch tested?

added unit tests

wForget avatar Jun 24 '24 07:06 wForget

Why would release be called multiple times?

rickyma avatar Jun 24 '24 07:06 rickyma

Test Results

 2 641 files  ± 0   2 641 suites  ±0   5h 27m 46s :stopwatch: -24s    946 tests + 2     945 :white_check_mark: + 3   1 :zzz: ±0  0 :x:  - 1  11 803 runs  +30  11 788 :white_check_mark: +31  15 :zzz: ±0  0 :x:  - 1 

Results for commit d1be2e4c. ± Comparison against base commit 1482804f.

github-actions[bot] avatar Jun 24 '24 08:06 github-actions[bot]

Why would release be called multiple times?

I didn't find out why it was called multiple times in #1628. Apparently, in https://github.com/apache/incubator-uniffle/blob/1482804f3bd365adce9af251a0a71ca463928083/storage/src/main/java/org/apache/uniffle/storage/handler/impl/DataSkippableReadHandler.java#L72, it checks whether shuffleIndexResult is not empty. One of my guesses is that buffer is cleaned up when netty client is closed.

wForget avatar Jun 24 '24 08:06 wForget

Hmm, I'm not sure about this change.

If we can find the root cause why it's called multiple times and can reason that the multiple released is necessary, this change looks good to me.

However, we have not find the root cause. By simply avoid/reset the managedBuffer to null, we may hide a bigger problem and causing more subtle problems later.

Could we add more logging first to help debugging this issue? For example, we can log the double release in the releasing method with the stack trace?

advancedxy avatar Jun 24 '24 11:06 advancedxy

We need to find out the root cause, which is better. Maybe it has something to do with Uniffle Hadoop related codes. Because this has only happened in RepartitionWithHadoopHybridStorageRssTest.

rickyma avatar Jun 25 '24 17:06 rickyma

Thanks, that makes sense to me, I will try uploading GA failure logs for uniffle first.

wForget avatar Jun 26 '24 01:06 wForget

Do we have any progress here? :)

rickyma avatar Jun 30 '24 07:06 rickyma