incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[#2111] improvement(client): cleanup the useless shuffle server clients when unregister shuffle

Open xianjingfeng opened this issue 1 year ago • 6 comments

What changes were proposed in this pull request?

Cleanup useless the shuffle server clients when unregister shuffle

Why are the changes needed?

Fix: #2111

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT

xianjingfeng avatar Sep 11 '24 07:09 xianjingfeng

Test Results

 3 034 files  +  152   3 034 suites  +152   6h 43m 35s ⏱️ + 46m 53s  1 179 tests +  160   1 178 ✅ +  160   1 💤 ±0  0 ❌ ±0  14 937 runs  +2 069  14 922 ✅ +2 069  15 💤 ±0  0 ❌ ±0 

Results for commit 8729ecdb. ± Comparison against base commit d170004a.

This pull request removes 30 and adds 190 tests. Note that renamed tests count towards both.
org.apache.uniffle.common.records.RecordsReaderWriterTest ‑ testWriteAndReadRecordFile1{String, File}[1]
org.apache.uniffle.common.records.RecordsReaderWriterTest ‑ testWriteAndReadRecordFile1{String, File}[2]
org.apache.uniffle.common.records.RecordsReaderWriterTest ‑ testWriteAndReadRecordFile2{String, File}[1]
org.apache.uniffle.common.records.RecordsReaderWriterTest ‑ testWriteAndReadRecordFile2{String, File}[2]
org.apache.uniffle.common.records.RecordsReaderWriterTest ‑ testWriteAndReadRecordFile3{String, File}[1]
org.apache.uniffle.common.records.RecordsReaderWriterTest ‑ testWriteAndReadRecordFile3{String, File}[2]
org.apache.uniffle.common.records.RecordsReaderWriterTest ‑ testWriteAndReadRecordFile4{String, File}[1]
org.apache.uniffle.common.records.RecordsReaderWriterTest ‑ testWriteAndReadRecordFile4{String, File}[2]
org.apache.uniffle.common.serializer.PartialInputStreamTest ‑ testReadFileInputStream
org.apache.uniffle.common.serializer.PartialInputStreamTest ‑ testReadMemroyInputStream
…
org.apache.hadoop.mapred.SortWriteBufferManagerTest ‑ testWriteNormalWithRemoteMerge
org.apache.hadoop.mapred.SortWriteBufferManagerTest ‑ testWriteNormalWithRemoteMergeAndCombine
org.apache.hadoop.mapreduce.task.reduce.RMRssShuffleTest ‑ testReadShuffleWithCombine
org.apache.hadoop.mapreduce.task.reduce.RMRssShuffleTest ‑ testReadShuffleWithoutCombine
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testDefaultIncludeExcludeProperties
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testExcludeProperties
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testIncludeProperties
org.apache.spark.shuffle.handle.MutableShuffleHandleInfoTest ‑ testUpdateAssignmentOnPartitionSplit
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.RMRssShuffleTest ‑ testReadMultiPartitionShuffleData
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.RMRssShuffleTest ‑ testReadShuffleData
…

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Sep 11 '24 08:09 github-actions[bot]

What will it happen if we don't have this patch?

jerqi avatar Sep 13 '24 10:09 jerqi

What will it happen if we don't have this patch?

If there are many stages in a application, the client will keep the clients for all shuffle servers, which will waste memory.

xianjingfeng avatar Sep 14 '24 01:09 xianjingfeng

What will it happen if we don't have this patch?

If there are many stages in a application, the client will keep the clients for all shuffle servers, which will waste memory.

One server may be used by multiple stages. We can't remove it if only one stage is finished.

jerqi avatar Sep 14 '24 03:09 jerqi

What will it happen if we don't have this patch?

If there are many stages in a application, the client will keep the clients for all shuffle servers, which will waste memory.

One server may be used by multiple stages. We can't remove it if only one stage is finished.

Right, not all the shuffle servers will be deleted in this PR

xianjingfeng avatar Sep 14 '24 09:09 xianjingfeng

I hope the closing way should be optional,

+1

which can be implemented by the pluggable closable policies.

What are the possible policies?

xianjingfeng avatar Sep 19 '24 03:09 xianjingfeng

@zuston gently ping.

jerqi avatar Dec 16 '24 03:12 jerqi

@zuston gently ping.

xianjingfeng avatar Apr 09 '25 08:04 xianjingfeng

Sorry the late review and I think i missed this thread. @xianjingfeng

zuston avatar Apr 09 '25 13:04 zuston

How about get the client through a guava cache with expire time setting, maybe it's more common without caring about whether the corresponding server is useless or availeble

This solution will have problems if the client is only retrieved from the cache once when it is used in the future.

xianjingfeng avatar Apr 14 '25 06:04 xianjingfeng

I think I should close this PR. Because I find ShuffleServerClientFactory#getShuffleServerClient needs to be optimized more than it. #2441

xianjingfeng avatar Apr 14 '25 07:04 xianjingfeng