celeborn icon indicating copy to clipboard operation
celeborn copied to clipboard

[CELEBORN-1501] Introduce application dimension resource consumption metrics of Worker

Open SteNicholas opened this issue 1 year ago • 7 comments

What changes were proposed in this pull request?

Introduce application dimension resource consumption metrics of Worker for ResourceConsumptionSource.

Why are the changes needed?

ResourceConsumption namespace metrics are generated for each user and they are identified using a metric tag at present. It's recommended to introduce application dimension resource consumption metrics that expose application dimension resource consumption of Worker. By monitoring resource consumption in the application dimension, you can obtain the actual situation of application resource consumption.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

curl http://celeborn-worker:9096/metrics|grep applicationId|grep disk|head -20
metrics_diskFileCount_Value{applicationId="application_1720756171504_197094_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 42 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1720756171504_197094_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 27157332949 1721132143020
metrics_diskFileCount_Value{applicationId="application_1718714878734_1549139_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 47 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1718714878734_1549139_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 483045590821721132143020
metrics_diskFileCount_Value{applicationId="application_1688369676084_19713351_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 20 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1688369676084_19713351_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 13112170199 1721132143020
metrics_diskFileCount_Value{applicationId="application_1718714878734_1552645_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 45 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1718714878734_1552645_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 353350343061721132143020
metrics_diskFileCount_Value{applicationId="application_1718714878734_1552665_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 59 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1718714878734_1552665_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 476373757311721132143020
metrics_diskFileCount_Value{applicationId="application_1720756171504_199529_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 59 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1720756171504_199529_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 54106810966 1721132143020
metrics_diskFileCount_Value{applicationId="application_1720756171504_199536_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 19 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1720756171504_199536_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 9215818606 1721132143020
metrics_diskFileCount_Value{applicationId="application_1650016801129_34416161_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 26 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1650016801129_34416161_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 23650636804 1721132143020
metrics_diskFileCount_Value{applicationId="application_1716712852097_2884119_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 12 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1716712852097_2884119_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 650314937 1721132143020
metrics_diskFileCount_Value{applicationId="application_1718714878734_1563526_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 16 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1718714878734_1563526_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 1555862722 1721132143020
image image image image

SteNicholas avatar Jul 16 '24 12:07 SteNicholas

Ping @leixm, @RexXiong.

SteNicholas avatar Jul 16 '24 12:07 SteNicholas

Ping @FMX, @RexXiong, @AngersZhuuuu.

SteNicholas avatar Jul 18 '24 12:07 SteNicholas

LGTM.

leixm avatar Jul 19 '24 02:07 leixm

Ping @RexXiong, @FMX.

SteNicholas avatar Jul 19 '24 03:07 SteNicholas

You can come up with a new Jira ticket. CELEBORN-1292 focuses on removing application dimension resource consumption.

FMX avatar Jul 26 '24 11:07 FMX

@FMX, I have updated commit message and pull request title with CELEBORN-1501. PTAL.

SteNicholas avatar Jul 29 '24 03:07 SteNicholas

An excess of applications may result in the number of metrics exceeding the default capacity of 4096, which can lead to the metrics becoming unwieldy. To address this, we should implement a limit to ensure that the number of metrics remains manageable and does not surpass a reasonable threshold. And for these application metrics, only the TOP[XX] are valuable.

RexXiong avatar Jul 31 '24 06:07 RexXiong

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Aug 20 '24 08:08 github-actions[bot]

@RexXiong, I have added celeborn.metrics.resourceConsumption.app.limit to prevent the total number of metrics from exceeding the metrics capacity. PTAL. cc @FMX.

SteNicholas avatar Aug 22 '24 07:08 SteNicholas

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 33.25%. Comparing base (ea6617c) to head (1698b56). Report is 45 commits behind head on main.

Files Patch % Lines
...cala/org/apache/celeborn/common/CelebornConf.scala 85.72% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2630      +/-   ##
==========================================
- Coverage   39.83%   33.25%   -6.58%     
==========================================
  Files         239      313      +74     
  Lines       15026    18282    +3256     
  Branches     1362     1678     +316     
==========================================
+ Hits         5984     6077      +93     
- Misses       8711    11865    +3154     
- Partials      331      340       +9     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 27 '24 07:08 codecov[bot]

@FMX, @RexXiong, I have updated the implementation for TopN application dimension resource consumption metrics of Worker. PTAL.

SteNicholas avatar Sep 03 '24 08:09 SteNicholas