celeborn icon indicating copy to clipboard operation
celeborn copied to clipboard

[CELEBORN-1337] Optimize WorkerInfo in WorkerStatusTracker by removing unused fields

Open ErikFang opened this issue 1 year ago • 11 comments

What changes were proposed in this pull request?

WorkerSummary is added as parent of WorkerInfo to be used in WorkerStatusTracker to reduce memory footprint

Why are the changes needed?

In our 1000+ nodes cluster, over 500 workers got excluded by master due to high disk usage, then master encountered severe gc pressure with app heartbeat. We found needCheckedWorkerList in HeartbeatFromApplication message can as large as 5M, and WorkerInfo contains lots of information that is not used by LifeCycleManager. The extra information(_diskInfos/_userResourceConsumption) can be removed to reduce message size

Does this PR introduce any user-facing change?

No

How was this patch tested?

ErikFang avatar Mar 17 '24 15:03 ErikFang

In this PR, WorkerSummary is restricted in WorkerStatusTracker Per my perspective, it is suggested to replace all occurrence of WorkerInfo in LifecycleManager if community likes the idea Alternatively, WorkerStatusTracker can just store UniqueId of WorkerInfo, but UniqueId is just a String, the type information is lost

ErikFang avatar Mar 17 '24 15:03 ErikFang

@ErikFang I think WorkerSummary sounds strange maybe WorkerId would be better. And I think it can replace WorkerInfo in most scenarios except the worker reports its disk info to the master and reports usage metrics.

FMX avatar Mar 18 '24 08:03 FMX

@ErikFang I think WorkerSummary sounds strange maybe WorkerId would be better. And I think it can replace WorkerInfo in most scenarios except the worker reports its disk info to the master and reports usage metrics.

WorkerId sounds good to me, will update the PR

ErikFang avatar Mar 18 '24 16:03 ErikFang

Codecov Report

Attention: Patch coverage is 49.01961% with 26 lines in your changes are missing coverage. Please review.

Project coverage is 48.83%. Comparing base (12c3779) to head (0e77774). Report is 12 commits behind head on main.

Files Patch % Lines
...a/org/apache/celeborn/common/meta/WorkerInfo.scala 49.02% 19 Missing and 7 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2398      +/-   ##
==========================================
+ Coverage   48.77%   48.83%   +0.06%     
==========================================
  Files         209      209              
  Lines       13109    13129      +20     
  Branches     1134     1136       +2     
==========================================
+ Hits         6393     6410      +17     
- Misses       6294     6297       +3     
  Partials      422      422              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 19 '24 16:03 codecov[bot]

@ErikFang Hi you'll need to fix the UTs first.

FMX avatar Mar 26 '24 03:03 FMX

@FMX all UT passes after applying spotless and rebase

ErikFang avatar Mar 31 '24 03:03 ErikFang

Currently, all workerInfos sent from the Master to the LifecycleManager have already eliminated the user resource consumption data. So, if we also remove diskInfos at the same time, will workerInfos only contain ports?

RexXiong avatar Apr 01 '24 06:04 RexXiong

Currently, all workerInfos sent from the Master to the LifecycleManager have already eliminated the user resource consumption data. So, if we also remove diskInfos at the same time, will workerInfos only contain ports?

ok, will update a new commit to remove diskinfo from HeartbeatFromApplicationResponse

ErikFang avatar Apr 01 '24 16:04 ErikFang

Codecov Report

Attention: Patch coverage is 0% with 58 lines in your changes are missing coverage. Please review.

Project coverage is 39.13%. Comparing base (fc23800) to head (088ea37). Report is 69 commits behind head on main.

Files Patch % Lines
...a/org/apache/celeborn/common/meta/WorkerInfo.scala 0.00% 58 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2398      +/-   ##
==========================================
- Coverage   48.96%   39.13%   -9.83%     
==========================================
  Files         209      217       +8     
  Lines       13146    13479     +333     
  Branches     1135     1179      +44     
==========================================
- Hits         6436     5273    -1163     
- Misses       6287     7905    +1618     
+ Partials      423      301     -122     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Apr 06 '24 15:04 codecov-commenter

Discussed with @RexXiong, there is still concern about adding WorkerId Type, will start a thread in mail list to discuss

I will also open a new PR to remove unused fields without adding WorkId

ErikFang avatar Apr 10 '24 09:04 ErikFang

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar May 15 '24 08:05 github-actions[bot]

@RexXiong sorry for the late update, https://github.com/apache/celeborn/pull/2529 is opened as discussed previously

ErikFang avatar May 25 '24 14:05 ErikFang

Since there is another PR for this JIRA, close this one.

FMX avatar Jun 07 '24 08:06 FMX