incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[FEATURE][SPARK] Support cancel async thread of handle blockEvent and rpc when writer is killed

Open summaryzb opened this issue 2 years ago • 1 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the feature

When task is killed for stage cancel, another task attempt succeed or some other reasons, The AddBlockEvent handling and sendShuffleData still work. Although needCancelRequest may cancel some work, but the AddBlockEvent in the blocking queue of threadPool still holds the shuffleblockdata, and so as to the rpc request that are already called but waiting for repsonse.

That will cause 3 problems:

  1. We freeAll memory onece the task is killed, but the shuffleBlockData hold by the async thread still occupy memory
  2. Many useless runnable related to the kille task are still working or wait to be executed
  3. CurrentlycheckBlockSendResult can not be interrupted, when the killed task caused by speculation is the last one of the shuffle map stage, it will block the next reduce stage scheduling

Motivation

No response

Describe the solution

  1. Cancel all the runnable that are wait to be executed or blocked in waiting for rpc callback
  2. Interrupt checkBlockSendResult immediately

Additional context

No response

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

summaryzb avatar Oct 26 '23 02:10 summaryzb

Nice catch.

zuston avatar Oct 26 '23 06:10 zuston