[SPARK-49252][CORE] Make`TaskSetExcludeList` and `HeathTracker` independent
What changes were proposed in this pull request?
Make the change such that TaskSetExcludeList and HeathTracker can be enabled independently.
When application level HealthTracker is created, but taskset level exclusion is not enabled, TaskSetExcludeList would be created in dry run mode, where it still records and reports task failure data to HealthTracker but does not participate in scheduler decision making.
Why are the changes needed?
Currently, when spark.excludeOnFailure.enabled is set to true, both task set level exclusion (TaskSetExcludeList) and application level (HealthTracker) would both be enabled.
In some cases, we only want to enable exclusion on a single dimension.
Does this PR introduce any user-facing change?
Yes, introduced two new user facing configs spark.excludeOnFailure.application.enabled and spark.excludeOnFailure.taskAndStage.enabled that allows setting exclusion for taskset/application individually.
How was this patch tested?
New unit tests.
Was this patch authored or co-authored using generative AI tooling?
No
@cloud-fan @jiangxb1987 can I get a review on this PR? Thx!
cc @Ngone51
It should be good to mention the dryrun mode introduced in this PR.
It should be good to mention the dryrun mode introduced in this PR.
Done updating description.
cc @jerryshao @mridulm @Ngone51
Thanks, merged to master!