[CELEBORN-1337] Optimize WorkerInfo in WorkerStatusTracker by removing unused fields
What changes were proposed in this pull request?
WorkerSummary is added as parent of WorkerInfo to be used in WorkerStatusTracker to reduce memory footprint
Why are the changes needed?
In our 1000+ nodes cluster, over 500 workers got excluded by master due to high disk usage, then master encountered severe gc pressure with app heartbeat. We found needCheckedWorkerList in HeartbeatFromApplication message can as large as 5M, and WorkerInfo contains lots of information that is not used by LifeCycleManager. The extra information(_diskInfos/_userResourceConsumption) can be removed to reduce message size
Does this PR introduce any user-facing change?
No
How was this patch tested?
In this PR, WorkerSummary is restricted in WorkerStatusTracker Per my perspective, it is suggested to replace all occurrence of WorkerInfo in LifecycleManager if community likes the idea Alternatively, WorkerStatusTracker can just store UniqueId of WorkerInfo, but UniqueId is just a String, the type information is lost
@ErikFang I think WorkerSummary sounds strange maybe WorkerId would be better. And I think it can replace WorkerInfo in most scenarios except the worker reports its disk info to the master and reports usage metrics.
@ErikFang I think
WorkerSummarysounds strange maybeWorkerIdwould be better. And I think it can replaceWorkerInfoin most scenarios except the worker reports its disk info to the master and reports usage metrics.
WorkerId sounds good to me, will update the PR
Codecov Report
Attention: Patch coverage is 49.01961% with 26 lines in your changes are missing coverage. Please review.
Project coverage is 48.83%. Comparing base (
12c3779) to head (0e77774). Report is 12 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| ...a/org/apache/celeborn/common/meta/WorkerInfo.scala | 49.02% | 19 Missing and 7 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2398 +/- ##
==========================================
+ Coverage 48.77% 48.83% +0.06%
==========================================
Files 209 209
Lines 13109 13129 +20
Branches 1134 1136 +2
==========================================
+ Hits 6393 6410 +17
- Misses 6294 6297 +3
Partials 422 422
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@ErikFang Hi you'll need to fix the UTs first.
@FMX all UT passes after applying spotless and rebase
Currently, all workerInfos sent from the Master to the LifecycleManager have already eliminated the user resource consumption data. So, if we also remove diskInfos at the same time, will workerInfos only contain ports?
Currently, all workerInfos sent from the Master to the LifecycleManager have already eliminated the user resource consumption data. So, if we also remove diskInfos at the same time, will workerInfos only contain ports?
ok, will update a new commit to remove diskinfo from HeartbeatFromApplicationResponse
Codecov Report
Attention: Patch coverage is 0% with 58 lines in your changes are missing coverage. Please review.
Project coverage is 39.13%. Comparing base (
fc23800) to head (088ea37). Report is 69 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| ...a/org/apache/celeborn/common/meta/WorkerInfo.scala | 0.00% | 58 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2398 +/- ##
==========================================
- Coverage 48.96% 39.13% -9.83%
==========================================
Files 209 217 +8
Lines 13146 13479 +333
Branches 1135 1179 +44
==========================================
- Hits 6436 5273 -1163
- Misses 6287 7905 +1618
+ Partials 423 301 -122
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Discussed with @RexXiong, there is still concern about adding WorkerId Type, will start a thread in mail list to discuss
I will also open a new PR to remove unused fields without adding WorkId
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.
@RexXiong sorry for the late update, https://github.com/apache/celeborn/pull/2529 is opened as discussed previously
Since there is another PR for this JIRA, close this one.