spark
spark copied to clipboard
[SPARK-39957][CORE] Delay onDisconnected to enable Driver receives ExecutorExitCode
What changes were proposed in this pull request?
When onDisconnected is triggered,
(1) Delay RemoveExecutor
for 5 seconds to enable driver receives ExecutorExitCode from slow path
(2) Prevent task scheduler from assigning tasks on the lost executor. (By adding the executor to executorsPendingLossReason
)
Why are the changes needed?
There are two methods to detect executor loss.
(1) (fast path) onDisconnected
Executor -> Driver:
When Executor closes its JVM, the socket (Netty's channel) will be closed. The function onDisconnected will be triggered when it knows the channel is closed.
(2) (slow path) ExecutorRunner -> Worker -> Master -> Driver (See #37385 for details) When executor exits with ExecutorExitCode, the exit code will be passed from ExecutorRunner to Driver.
Because fast path determines the executor loss without the information of ExecutorExitCode, these two methods may categorize same cases into different conclusions. For example, when Executor exits with ExecutorExitCode HEARTBEAT_FAILURE, onDisconnected will consider the executor loss as a task failure, but slow path will consider it as a network failure. Obviously, HEARTBEAT_FAILURE is a network failure.
[Notice] For more details about ExecutorExitCode, check #37385 for more details.
Does this PR introduce any user-facing change?
No
How was this patch tested?
build/sbt "core/testOnly *SparkContextSuite -- -z ExitCode"
cc. @Ngone51 @jiangxb1987
Can one of the admins verify this patch?
Gentle ping @Ngone51 @mridulm
I will let @Ngone51 review/merge - I do not have a lot of context on standalone :-)
Thanks @kevin85421 @mridulm , merged to Master!
Thank @mridulm @Ngone51 for the review!