spark
spark copied to clipboard
[SPARK-22876][YARN] Respect YARN AM failure validity interval
What changes were proposed in this pull request?
Adds support for using the YARN application master failure validity interval for determining whether an attempt is the last attempt during shutdown. During the application master shutdown hook, if the validity interval is specified, we get previous attempt information from the resource manager to make a best attempt at mirroring whether YARN is going to retry the application or not. This is because, AFAIK, there is no way for Spark to know 100% if YARN will retry the application or not.
This can lead to two scenarios where Spark is wrong about whether YARN will retry it or not:
- Spark thinks it is the last attempt but YARN will retry the application. This is what can already happen by not respecting the failure validity at all.
- Spark thinks the application will retry but YARN thinks it was the last attempt. This will result in the staging directory not be cleaned up, but that can already happen if Spark gets killed during the last attempt and doesn't run it's shutdown hooks.
Why are the changes needed?
Spark supports passing the AM application validity interval to YARN using the spark.yarn.am.attemptFailuresValidityInterval setting, but then ignores this during shutdown when determining whether it's the last attempt and whether to cleanup the staging directory. Currently if you try to use this setting, YARN will respect the setting and attempt to retry the application, but only if the last attempt was killed with a hard shutdown (like OOM) that prevents the shutdown logic from occurring. During a graceful shutdown, Spark will delete the staging directory and unregister the application from YARN, preventing any further attempts.
Does this PR introduce any user-facing change?
Yes, Spark on YARN better supports graceful recovery and retry of jobs to better enable long running Spark Streaming jobs.
How was this patch tested?
New UT. It unfortunately relies on sleeping to test the validity timing, which seems very prone to flakiness, but I couldn't think of any other way to test it. If the approach seems sound, it might be worth just removing the test. I don't see any tests for simply the max YARN attempts.
Was this patch authored or co-authored using generative AI tooling?
No
+CC @tgravescs, @attilapiros
so to clarify the issue here, the original PR that added this validityInterval config (https://github.com/apache/spark/pull/8857/files) just seems to call the YARN setAttemptFailuresValidityInterval(). So you are saying that config works on the YARN side but on the Spark side we think its the last attempt when it really isn't. That makes sense.
I'm not sure I understand how your second point above though (Spark thinks the application will retry but YARN thinks it was the last attempt.) happens with this config? what is the scenario there?
I'm not sure I understand how your second point above though (Spark thinks the application will retry but YARN thinks it was the last attempt.) happens with this config? what is the scenario there?
Theoretically that should only happen if there are inconsistencies/bugs in the implementation on the Spark side compared to the YARN side. Just included it has a general "risk" of adding this.
This has been working great in production for us for nearly a year now
Gentle ping, has been working great for us for over a year now
Hi @Kimahriman
We've encountered a problem where applications are still retrying, despite their failed attempts having reached the limit of spark.yarn.maxAppAttempts within the spark.yarn.am.attemptFailuresValidityInterval.
spark.yarn.am.attemptFailuresValidityInterval =1 hour
spark.yarn.am.maxAttempts=2
yarn.resourcemanager.am.max-attempts=10
INFO 2025-06-15T14:20:55.547Z Registering app attempt : appattempt_*******_****_000001
INFO 2025-06-15T16:06:55.361Z Registering app attempt : appattempt_*******_****_000002
INFO 2025-06-15T16:25:52.932Z Registering app attempt : appattempt_*******_****_000003
Do you think it matches this case you mentioned, and if so, will it be fixed by this PR?
Spark thinks it is the last attempt but YARN will retry the application. This is what can already happen by not respecting the failure validity at all.
Thank you!
Hmmm no that scenario doesn't make sense to me I'm not sure why that would happen. This wouldn't fix that, not sure why the third attempt happened without more debugging
CC @yaooqinn as one of the people who's been making non-refactoring changes to the YARN code as of late.