[CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned
What changes were proposed in this pull request?
Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.
Why are the changes needed?
Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the forceExitTimeout.
Does this PR introduce any user-facing change?
NO
How was this patch tested?
NA
@s0nskar, how did you use unreleased shuffle count in production practice?
@SteNicholas We're currently not in production but this will help us tune the forceExitTimeout config better and see if the default value is working for us or not. As we probably won't enable replication for lot of jobs, we want shuffle data to not be lost when worker exits.
@s0nskar, please add the UnreleasedShuffleCount metric in celeborn-dashboard.json file.
@s0nskar, any update for this metric?
Sorry messed up the commit history on this one, reopening the PR again.
@SteNicholas updated the PR with dashboard changes.
@FMX @SteNicholas Getting below error in tests, how can we clean up the metrics cache.
java.lang.IllegalArgumentException: UnreleasedShuffleCount{role="Worker"} is already used for a different type of metric
@s0nskar, you should not invoke addCounter(UNRELEASED_SHUFFLE_COUNT) in WorkerSource.
@FMX ping for review.
Merged into main(v0.6.0).