incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[#1020] feat(spark): Support hybrid shuffle manager

Open pegasas opened this issue 2 years ago • 9 comments

What changes were proposed in this pull request?

Delegation shuffle manager has been supported in uniffle, which is useful for avoiding instability of uniffle cluster.

But this is not enough, we hope the hybrid shuffle manager could be supported, that will support using ESS or Uniffle on different stages in one App

Why are the changes needed?

  1. Add HybridShuffleManager.
  2. Add UnitTests.

Fix: https://github.com/apache/incubator-uniffle/issues/1020

Does this PR introduce any user-facing change?

(Please list the user-facing changes introduced by your change, including

  1. Change in user-facing APIs.
  2. Addition or removal of property keys.)

No.

How was this patch tested?

(Please test your changes, and provide instructions on how to test it:

  1. If you add a feature or fix a bug, add a test to cover your changes.
  2. If you fix a flaky test, repeat it for many times to prove it works.)

pegasas avatar Aug 12 '23 12:08 pegasas

try to add feature: https://github.com/apache/incubator-uniffle/issues/1020

pegasas avatar Aug 12 '23 12:08 pegasas

Hi, @zuston , @jerqi , I have a naive question. How to decide on the shuffle strategy (ESS/RSS/Uniffle) of each shuffleId for spark driver/executor? When to change RSS to ESS is our best practise?

You can see the DelegationShuffleManager design, the pass or not depends on the load or other access checker of coordinator determines

Hi, @zuston , I created a DRAFT PR for this issue. In HybridShuffleManager, I created ESS/RSS ShuffleManager for both, Each time would be decided by Coordinator's response to decide if we should use ESS/RSS.

Am I resolving this issue in right way you think?

pegasas avatar Aug 13 '23 17:08 pegasas

Codecov Report

Merging #1136 (0e6bc51) into master (931d6cd) will increase coverage by 0.87%. The diff coverage is 51.54%.

@@             Coverage Diff              @@
##             master    #1136      +/-   ##
============================================
+ Coverage     54.12%   54.99%   +0.87%     
- Complexity     2516     2692     +176     
============================================
  Files           383      378       -5     
  Lines         21618    20988     -630     
  Branches       1841     1994     +153     
============================================
- Hits          11700    11543     -157     
+ Misses         9186     8739     -447     
+ Partials        732      706      -26     
Files Coverage Δ
.../apache/spark/shuffle/HybridRssShuffleManager.java 65.87% <65.87%> (ø)
...org/apache/spark/shuffle/HybridShuffleManager.java 40.60% <40.60%> (ø)

... and 35 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Aug 14 '23 03:08 codecov-commenter

A use of rss or ess needs to be bound to a shuffle, which is determined by registerShuffle .

Feel free to discuss if you want to know more about this. Thanks @pegasas

zuston avatar Aug 15 '23 01:08 zuston

A use of rss or ess needs to be bound to a shuffle, which is determined by registerShuffle .

Feel free to discuss if you want to know more about this. Thanks @pegasas

Thanks @zuston !

I guess if Shufflemaster failed, it will reassign ESS/RSS to an shufflehandle So (ShuffleId, ESS/RSS) should be saved in memory?

Is my assumption right?

pegasas avatar Aug 15 '23 03:08 pegasas

A use of rss or ess needs to be bound to a shuffle, which is determined by registerShuffle . Feel free to discuss if you want to know more about this. Thanks @pegasas

Thanks @zuston !

I guess if Shufflemaster failed, it will reassign ESS/RSS to an shufflehandle So (ShuffleId, ESS/RSS) should be saved in memory?

Is my assumption right?

Yes.

zuston avatar Aug 15 '23 03:08 zuston

Is this ready for review ? @pegasas

zuston avatar Sep 19 '23 07:09 zuston

Is this ready for review ? @pegasas

I am writing an integration test of this case.

Besides,

What's our roadmap of HybridShuffleManager and DelegateShuffleManager? Will HybridShuffleManager will replace delegateShuffleManager eventually?

pegasas avatar Sep 19 '23 10:09 pegasas

What's our roadmap of HybridShuffleManager and DelegateShuffleManager? Will HybridShuffleManager will replace delegateShuffleManager eventually?

I think yes. But in current stage, let's reserve it.

zuston avatar Oct 25 '23 09:10 zuston