incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

Introduce data cleanup mechanism on stage level

Open zuston opened this issue 3 years ago • 5 comments

What changes were proposed in this pull request?

Introduce data cleanup mechanism on stage level

Why are the changes needed?

This PR is to optimize the disk capacity. For example

  1. For some spark ML jobs, it will run multiple stages and reserve large unused shuffle data in shuffle-servers.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UTs

zuston avatar Sep 28 '22 07:09 zuston

Codecov Report

Merging #249 (e24fbbd) into master (fc6d49b) will decrease coverage by 0.94%. The diff coverage is 55.02%.

@@             Coverage Diff              @@
##             master     #249      +/-   ##
============================================
- Coverage     60.11%   59.17%   -0.95%     
- Complexity     1217     1369     +152     
============================================
  Files           150      166      +16     
  Lines          7602     8974    +1372     
  Branches        718      853     +135     
============================================
+ Hits           4570     5310     +740     
- Misses         2791     3390     +599     
- Partials        241      274      +33     
Impacted Files Coverage Δ
...e/uniffle/client/factory/ShuffleClientFactory.java 0.00% <0.00%> (ø)
...pache/uniffle/server/ShuffleServerGrpcService.java 0.87% <0.00%> (-0.03%) :arrow_down:
...he/uniffle/server/storage/MultiStorageManager.java 37.50% <0.00%> (ø)
...storage/handler/impl/HdfsShuffleDeleteHandler.java 0.00% <0.00%> (ø)
...e/storage/handler/impl/LocalFileDeleteHandler.java 0.00% <0.00%> (ø)
...he/uniffle/client/impl/ShuffleWriteClientImpl.java 21.23% <9.67%> (-1.75%) :arrow_down:
.../org/apache/uniffle/server/ShuffleTaskManager.java 76.80% <78.94%> (-0.16%) :arrow_down:
...he/uniffle/server/buffer/ShuffleBufferManager.java 83.33% <86.36%> (+1.17%) :arrow_up:
...va/org/apache/uniffle/server/event/PurgeEvent.java 88.88% <88.88%> (ø)
.../java/org/apache/spark/shuffle/RssSparkConfig.java 96.77% <100.00%> (+0.10%) :arrow_up:
... and 23 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Sep 28 '22 08:09 codecov-commenter

PTAL @jerqi . The unregister is now scoped with Spark, MR will be supported in the next commits.

zuston avatar Sep 28 '22 10:09 zuston

PTAL @jerqi . The unregister is now scoped with Spark, MR will be supported in the next commits.

Why do the mr need to clean up the shuffle of stage level? MR don't have multiple stages.

jerqi avatar Sep 28 '22 11:09 jerqi

PTAL @jerqi . The unregister is now scoped with Spark, MR will be supported in the next commits.

Why do the mr need to clean up the shuffle of stage level? MR don't have multiple stages.

Oh yes. My fault.

zuston avatar Sep 28 '22 11:09 zuston

Gentle ping @jerqi If u have time, could u help review this? I think I have some time to quick fix at national day.

zuston avatar Sep 30 '22 15:09 zuston

I will take a look at this pr asap.

jerqi avatar Oct 08 '22 02:10 jerqi

One question? Why do we add the concept of PurgeEvent? What's the advantage? Could we just add a shuffle level deletion method for StorageManager?

jerqi avatar Oct 08 '22 09:10 jerqi

One question? Why do we add the concept of PurgeEvent? What's the advantage? Could we just add a shuffle level deletion method for StorageManager?

Two reasons.

  1. Make clearResourceThread clean two kinds of events of stage and app level data
  2. I dont want to introduce extra delete method in StorageManager because most of deletion logic is the same

zuston avatar Oct 08 '22 09:10 zuston

Updated.

zuston avatar Oct 10 '22 06:10 zuston

One question? Why do we add the concept of PurgeEvent? What's the advantage? Could we just add a shuffle level deletion method for StorageManager?

Two reasons.

  1. Make clearResourceThread clean two kinds of events of stage and app level data
  2. I dont want to introduce extra delete method in StorageManager because most of deletion logic is the same

Actually we still add a method removeResourcesByShuffleIds. Let me think twice.

jerqi avatar Oct 11 '22 03:10 jerqi

Changelog of latest commit

  1. Extract the config entries of unregistering thread pool size and request timeout sec

spark.rss.client.unregister.request.timeout.sec and spark.rss.client.unregister.thread.pool.size

zuston avatar Oct 11 '22 09:10 zuston

Gentle ping @jerqi

zuston avatar Oct 12 '22 03:10 zuston

Thanks for your review @jerqi

zuston avatar Oct 12 '22 08:10 zuston