spark
spark copied to clipboard
[SPARK-46920][YARN] Improve executor exit error message on YARN
What changes were proposed in this pull request?
Improve executor exit error message on YARN, with additional explanation of exit code defined by Spark.
Why are the changes needed?
Spark defines its own exit codes, which have overlap with exit codes defined by YARN, thus diagnostics reported by YARN may be misleading. For example, exit code 56 is defined as HEARTBEAT_FAILURE in Spark, but INVALID_DOCKER_IMAGE_NAME in Hadoop, thus the error message displayed in UI is misleading.
Does this PR introduce any user-facing change?
Yes, the UI displays more information when the executor runs on YARN exits without zero code.
How was this patch tested?
Because HEARTBEAT_FAILURE depends on the network and Driver's load, to simplify the test, I just use select java_method('java.lang.System', 'exit', 56) to simulate the above case, so please ignore the different diagnostics reported by YARN here.
Was this patch authored or co-authored using generative AI tooling?
No.
cc @yaooqinn @srowen please take a look when you have time.
+CC @tgravescs
kindly ping @tgravescs and @yaooqinn, would you please take a look?
kindly ping @tgravescs
its a little unclear to me exactly what the user changes. The screen shots in the description don't make sense with the code changes made. From what I can tell you are adding in another field:
s"Possible causes: $sparkExitCodeReason "
which is the ExecutorExitCode.explainExitCode(exitStatus). This seems fine to me. But the screen shots above look like output is quite different with the shell error output, are those screen shots correct for a before and after this change? can you please explain if I'm missing something or how this is changing a 56 error code from invalid docker to unable to send heartbeat
@tgravescs Thanks for checking, I replaced the verified result image.
the inconsistent diagnostics message is caused by yarn.nodemanager.container-executor.class, previous test uses org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor while our production Hadoop cluster uses org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor, thus the diagnostics message is quite different.
ok, changes seem fine to me
Merged into master for Spark 4.0. Thanks @pan3793 @srowen @tgravescs @mridulm