[SPARK-47764][FOLLOW-UP][WIP] Change to use ShuffleDriverComponents.removeShuffle to remove shuffle properly
What changes were proposed in this pull request?
This is a follow-up for https://github.com/apache/spark/pull/45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies.
There was a bug where ShuffleManager.unregisterShuffle was used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to use ShuffleDriverComponents.removeShuffle instead.
Why are the changes needed?
This is to address the comments in https://github.com/apache/spark/pull/45930#discussion_r1584223064
Does this PR introduce any user-facing change?
No
How was this patch tested?
TBD
Was this patch authored or co-authored using generative AI tooling?
No
Could you re-trigger CI, @bozhang2820 ?
To prevent accidental merging, I converted this to Draft PR because the CI is broken and the PR title has [WIP].
Could you re-trigger CI, @bozhang2820 ?
Sorry for the late reply @dongjoon-hyun.. Updated the test and removed the WIP tag.
It seems that the current test failure is unrelated to this change?
@bozhang2820 this PR is needed for the feature to work in my tests, I am just wondering what the status of this is.
Looks reasonable - can you add a test for this ? We can use
localvslocal-clusterto test the expected behavior
I tried adding a unit test for local-cluster mode but found it a bit difficult:
- The cleanup is done in a best-effort basis, and in local-cluster mode the shuffle blocks are less likely to be cleaned up in time.
- There seems to be no API to get the MigratableResolver on executors from the driver side.
What do you think?
Thank you, @bozhang2820, @cloud-fan , @mridulm . Merged to master for Apache Spark 4.0.0-preview2.
For the test cases, please file a JIRA issue not to forget because it seems to need more time, @bozhang2820 .