celeborn icon indicating copy to clipboard operation
celeborn copied to clipboard

[CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned

Open s0nskar opened this issue 1 year ago • 8 comments

What changes were proposed in this pull request?

Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.

Screenshot 2024-09-16 at 11 12 33 AM

Why are the changes needed?

Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the forceExitTimeout.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

NA

s0nskar avatar Aug 27 '24 15:08 s0nskar

@s0nskar, how did you use unreleased shuffle count in production practice?

SteNicholas avatar Aug 27 '24 19:08 SteNicholas

@SteNicholas We're currently not in production but this will help us tune the forceExitTimeout config better and see if the default value is working for us or not. As we probably won't enable replication for lot of jobs, we want shuffle data to not be lost when worker exits.

s0nskar avatar Aug 28 '24 04:08 s0nskar

@s0nskar, please add the UnreleasedShuffleCount metric in celeborn-dashboard.json file.

SteNicholas avatar Aug 29 '24 08:08 SteNicholas

@s0nskar, any update for this metric?

SteNicholas avatar Sep 14 '24 06:09 SteNicholas

Sorry messed up the commit history on this one, reopening the PR again.

@SteNicholas updated the PR with dashboard changes.

s0nskar avatar Sep 16 '24 06:09 s0nskar

@FMX @SteNicholas Getting below error in tests, how can we clean up the metrics cache.

java.lang.IllegalArgumentException: UnreleasedShuffleCount{role="Worker"} is already used for a different type of metric

s0nskar avatar Sep 23 '24 08:09 s0nskar

@s0nskar, you should not invoke addCounter(UNRELEASED_SHUFFLE_COUNT) in WorkerSource.

SteNicholas avatar Sep 25 '24 07:09 SteNicholas

@FMX ping for review.

s0nskar avatar Sep 30 '24 04:09 s0nskar

Merged into main(v0.6.0).

FMX avatar Oct 08 '24 09:10 FMX