spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-47764][FOLLOW-UP][WIP] Change to use ShuffleDriverComponents.removeShuffle to remove shuffle properly

Open bozhang2820 opened this issue 1 year ago • 2 comments

What changes were proposed in this pull request?

This is a follow-up for https://github.com/apache/spark/pull/45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies.

There was a bug where ShuffleManager.unregisterShuffle was used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to use ShuffleDriverComponents.removeShuffle instead.

Why are the changes needed?

This is to address the comments in https://github.com/apache/spark/pull/45930#discussion_r1584223064

Does this PR introduce any user-facing change?

No

How was this patch tested?

TBD

Was this patch authored or co-authored using generative AI tooling?

No

bozhang2820 avatar Apr 30 '24 12:04 bozhang2820

Could you re-trigger CI, @bozhang2820 ?

dongjoon-hyun avatar May 01 '24 05:05 dongjoon-hyun

To prevent accidental merging, I converted this to Draft PR because the CI is broken and the PR title has [WIP].

dongjoon-hyun avatar May 07 '24 16:05 dongjoon-hyun

Could you re-trigger CI, @bozhang2820 ?

Sorry for the late reply @dongjoon-hyun.. Updated the test and removed the WIP tag.

It seems that the current test failure is unrelated to this change?

bozhang2820 avatar May 27 '24 05:05 bozhang2820

@bozhang2820 this PR is needed for the feature to work in my tests, I am just wondering what the status of this is.

abellina avatar Jul 12 '24 21:07 abellina

Looks reasonable - can you add a test for this ? We can use local vs local-cluster to test the expected behavior

I tried adding a unit test for local-cluster mode but found it a bit difficult:

  1. The cleanup is done in a best-effort basis, and in local-cluster mode the shuffle blocks are less likely to be cleaned up in time.
  2. There seems to be no API to get the MigratableResolver on executors from the driver side.

What do you think?

bozhang2820 avatar Jul 23 '24 14:07 bozhang2820

Thank you, @bozhang2820, @cloud-fan , @mridulm . Merged to master for Apache Spark 4.0.0-preview2.

dongjoon-hyun avatar Jul 24 '24 23:07 dongjoon-hyun

For the test cases, please file a JIRA issue not to forget because it seems to need more time, @bozhang2820 .

dongjoon-hyun avatar Jul 24 '24 23:07 dongjoon-hyun