spark [SPARK-46920][YARN] Improve executor exit error message on YARN

What changes were proposed in this pull request?

Improve executor exit error message on YARN, with additional explanation of exit code defined by Spark.

Why are the changes needed?

Spark defines its own exit codes, which have overlap with exit codes defined by YARN, thus diagnostics reported by YARN may be misleading. For example, exit code 56 is defined as HEARTBEAT_FAILURE in Spark, but INVALID_DOCKER_IMAGE_NAME in Hadoop, thus the error message displayed in UI is misleading.

Does this PR introduce any user-facing change?

Yes, the UI displays more information when the executor runs on YARN exits without zero code.

How was this patch tested?

Because HEARTBEAT_FAILURE depends on the network and Driver's load, to simplify the test, I just use select java_method('java.lang.System', 'exit', 56) to simulate the above case, so please ignore the different diagnostics reported by YARN here.

Was this patch authored or co-authored using generative AI tooling?

No.

Jan 30 '24 12:01 pan3793

cc @yaooqinn @srowen please take a look when you have time.

Jan 30 '24 13:01 pan3793

+CC @tgravescs

Jan 30 '24 18:01 mridulm

kindly ping @tgravescs and @yaooqinn, would you please take a look?

Feb 05 '24 09:02 pan3793

kindly ping @tgravescs

Mar 19 '24 02:03 pan3793

its a little unclear to me exactly what the user changes. The screen shots in the description don't make sense with the code changes made. From what I can tell you are adding in another field:

s"Possible causes: $sparkExitCodeReason "

which is the ExecutorExitCode.explainExitCode(exitStatus). This seems fine to me. But the screen shots above look like output is quite different with the shell error output, are those screen shots correct for a before and after this change? can you please explain if I'm missing something or how this is changing a 56 error code from invalid docker to unable to send heartbeat

Mar 19 '24 15:03 tgravescs

@tgravescs Thanks for checking, I replaced the verified result image.

the inconsistent diagnostics message is caused by yarn.nodemanager.container-executor.class, previous test uses org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor while our production Hadoop cluster uses org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor, thus the diagnostics message is quite different.

Mar 20 '24 09:03 pan3793

ok, changes seem fine to me

Mar 20 '24 15:03 tgravescs

Merged into master for Spark 4.0. Thanks @pan3793 @srowen @tgravescs @mridulm

Mar 21 '24 05:03 LuciferYang