spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-22876][YARN] Respect YARN AM failure validity interval

Open Kimahriman opened this issue 2 years ago • 4 comments

What changes were proposed in this pull request?

Adds support for using the YARN application master failure validity interval for determining whether an attempt is the last attempt during shutdown. During the application master shutdown hook, if the validity interval is specified, we get previous attempt information from the resource manager to make a best attempt at mirroring whether YARN is going to retry the application or not. This is because, AFAIK, there is no way for Spark to know 100% if YARN will retry the application or not.

This can lead to two scenarios where Spark is wrong about whether YARN will retry it or not:

  • Spark thinks it is the last attempt but YARN will retry the application. This is what can already happen by not respecting the failure validity at all.
  • Spark thinks the application will retry but YARN thinks it was the last attempt. This will result in the staging directory not be cleaned up, but that can already happen if Spark gets killed during the last attempt and doesn't run it's shutdown hooks.

Why are the changes needed?

Spark supports passing the AM application validity interval to YARN using the spark.yarn.am.attemptFailuresValidityInterval setting, but then ignores this during shutdown when determining whether it's the last attempt and whether to cleanup the staging directory. Currently if you try to use this setting, YARN will respect the setting and attempt to retry the application, but only if the last attempt was killed with a hard shutdown (like OOM) that prevents the shutdown logic from occurring. During a graceful shutdown, Spark will delete the staging directory and unregister the application from YARN, preventing any further attempts.

Does this PR introduce any user-facing change?

Yes, Spark on YARN better supports graceful recovery and retry of jobs to better enable long running Spark Streaming jobs.

How was this patch tested?

New UT. It unfortunately relies on sleeping to test the validity timing, which seems very prone to flakiness, but I couldn't think of any other way to test it. If the approach seems sound, it might be worth just removing the test. I don't see any tests for simply the max YARN attempts.

Was this patch authored or co-authored using generative AI tooling?

No

Kimahriman avatar Aug 19 '23 15:08 Kimahriman

+CC @tgravescs, @attilapiros

mridulm avatar Aug 20 '23 05:08 mridulm

so to clarify the issue here, the original PR that added this validityInterval config (https://github.com/apache/spark/pull/8857/files) just seems to call the YARN setAttemptFailuresValidityInterval(). So you are saying that config works on the YARN side but on the Spark side we think its the last attempt when it really isn't. That makes sense.

I'm not sure I understand how your second point above though (Spark thinks the application will retry but YARN thinks it was the last attempt.) happens with this config? what is the scenario there?

tgravescs avatar Aug 21 '23 13:08 tgravescs

I'm not sure I understand how your second point above though (Spark thinks the application will retry but YARN thinks it was the last attempt.) happens with this config? what is the scenario there?

Theoretically that should only happen if there are inconsistencies/bugs in the implementation on the Spark side compared to the YARN side. Just included it has a general "risk" of adding this.

Kimahriman avatar Aug 21 '23 13:08 Kimahriman

This has been working great in production for us for nearly a year now

Kimahriman avatar Jun 04 '24 12:06 Kimahriman

Gentle ping, has been working great for us for over a year now

Kimahriman avatar Mar 06 '25 19:03 Kimahriman

Hi @Kimahriman

We've encountered a problem where applications are still retrying, despite their failed attempts having reached the limit of spark.yarn.maxAppAttempts within the spark.yarn.am.attemptFailuresValidityInterval.

spark.yarn.am.attemptFailuresValidityInterval =1 hour
spark.yarn.am.maxAttempts=2
yarn.resourcemanager.am.max-attempts=10
INFO 2025-06-15T14:20:55.547Z Registering app attempt : appattempt_*******_****_000001
INFO 2025-06-15T16:06:55.361Z Registering app attempt : appattempt_*******_****_000002
INFO 2025-06-15T16:25:52.932Z Registering app attempt : appattempt_*******_****_000003

Do you think it matches this case you mentioned, and if so, will it be fixed by this PR?

Spark thinks it is the last attempt but YARN will retry the application. This is what can already happen by not respecting the failure validity at all.

Thank you!

xujiongda avatar Jul 03 '25 01:07 xujiongda

Hmmm no that scenario doesn't make sense to me I'm not sure why that would happen. This wouldn't fix that, not sure why the third attempt happened without more debugging

Kimahriman avatar Jul 03 '25 01:07 Kimahriman

CC @yaooqinn as one of the people who's been making non-refactoring changes to the YARN code as of late.

holdenk avatar Nov 17 '25 18:11 holdenk