Introduce data cleanup mechanism on stage level
What changes were proposed in this pull request?
Introduce data cleanup mechanism on stage level
Why are the changes needed?
This PR is to optimize the disk capacity. For example
- For some spark ML jobs, it will run multiple stages and reserve large unused shuffle data in shuffle-servers.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UTs
Codecov Report
Merging #249 (e24fbbd) into master (fc6d49b) will decrease coverage by
0.94%. The diff coverage is55.02%.
@@ Coverage Diff @@
## master #249 +/- ##
============================================
- Coverage 60.11% 59.17% -0.95%
- Complexity 1217 1369 +152
============================================
Files 150 166 +16
Lines 7602 8974 +1372
Branches 718 853 +135
============================================
+ Hits 4570 5310 +740
- Misses 2791 3390 +599
- Partials 241 274 +33
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...e/uniffle/client/factory/ShuffleClientFactory.java | 0.00% <0.00%> (ø) |
|
| ...pache/uniffle/server/ShuffleServerGrpcService.java | 0.87% <0.00%> (-0.03%) |
:arrow_down: |
| ...he/uniffle/server/storage/MultiStorageManager.java | 37.50% <0.00%> (ø) |
|
| ...storage/handler/impl/HdfsShuffleDeleteHandler.java | 0.00% <0.00%> (ø) |
|
| ...e/storage/handler/impl/LocalFileDeleteHandler.java | 0.00% <0.00%> (ø) |
|
| ...he/uniffle/client/impl/ShuffleWriteClientImpl.java | 21.23% <9.67%> (-1.75%) |
:arrow_down: |
| .../org/apache/uniffle/server/ShuffleTaskManager.java | 76.80% <78.94%> (-0.16%) |
:arrow_down: |
| ...he/uniffle/server/buffer/ShuffleBufferManager.java | 83.33% <86.36%> (+1.17%) |
:arrow_up: |
| ...va/org/apache/uniffle/server/event/PurgeEvent.java | 88.88% <88.88%> (ø) |
|
| .../java/org/apache/spark/shuffle/RssSparkConfig.java | 96.77% <100.00%> (+0.10%) |
:arrow_up: |
| ... and 23 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
PTAL @jerqi . The unregister is now scoped with Spark, MR will be supported in the next commits.
PTAL @jerqi . The unregister is now scoped with Spark, MR will be supported in the next commits.
Why do the mr need to clean up the shuffle of stage level? MR don't have multiple stages.
PTAL @jerqi . The unregister is now scoped with Spark, MR will be supported in the next commits.
Why do the mr need to clean up the shuffle of stage level? MR don't have multiple stages.
Oh yes. My fault.
Gentle ping @jerqi If u have time, could u help review this? I think I have some time to quick fix at national day.
I will take a look at this pr asap.
One question? Why do we add the concept of PurgeEvent? What's the advantage? Could we just add a shuffle level deletion method for StorageManager?
One question? Why do we add the concept of PurgeEvent? What's the advantage? Could we just add a shuffle level deletion method for StorageManager?
Two reasons.
- Make clearResourceThread clean two kinds of events of stage and app level data
- I dont want to introduce extra delete method in StorageManager because most of deletion logic is the same
Updated.
One question? Why do we add the concept of PurgeEvent? What's the advantage? Could we just add a shuffle level deletion method for StorageManager?
Two reasons.
- Make clearResourceThread clean two kinds of events of stage and app level data
- I dont want to introduce extra delete method in StorageManager because most of deletion logic is the same
Actually we still add a method removeResourcesByShuffleIds. Let me think twice.
Changelog of latest commit
- Extract the config entries of unregistering thread pool size and request timeout sec
spark.rss.client.unregister.request.timeout.sec and spark.rss.client.unregister.thread.pool.size
Gentle ping @jerqi
Thanks for your review @jerqi