[CELEBORN-1501] Introduce application dimension resource consumption metrics of Worker
What changes were proposed in this pull request?
Introduce application dimension resource consumption metrics of Worker for ResourceConsumptionSource.
Why are the changes needed?
ResourceConsumption namespace metrics are generated for each user and they are identified using a metric tag at present. It's recommended to introduce application dimension resource consumption metrics that expose application dimension resource consumption of Worker. By monitoring resource consumption in the application dimension, you can obtain the actual situation of application resource consumption.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
curl http://celeborn-worker:9096/metrics|grep applicationId|grep disk|head -20
metrics_diskFileCount_Value{applicationId="application_1720756171504_197094_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 42 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1720756171504_197094_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 27157332949 1721132143020
metrics_diskFileCount_Value{applicationId="application_1718714878734_1549139_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 47 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1718714878734_1549139_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 483045590821721132143020
metrics_diskFileCount_Value{applicationId="application_1688369676084_19713351_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 20 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1688369676084_19713351_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 13112170199 1721132143020
metrics_diskFileCount_Value{applicationId="application_1718714878734_1552645_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 45 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1718714878734_1552645_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 353350343061721132143020
metrics_diskFileCount_Value{applicationId="application_1718714878734_1552665_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 59 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1718714878734_1552665_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 476373757311721132143020
metrics_diskFileCount_Value{applicationId="application_1720756171504_199529_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 59 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1720756171504_199529_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 54106810966 1721132143020
metrics_diskFileCount_Value{applicationId="application_1720756171504_199536_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 19 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1720756171504_199536_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 9215818606 1721132143020
metrics_diskFileCount_Value{applicationId="application_1650016801129_34416161_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 26 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1650016801129_34416161_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 23650636804 1721132143020
metrics_diskFileCount_Value{applicationId="application_1716712852097_2884119_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 12 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1716712852097_2884119_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 650314937 1721132143020
metrics_diskFileCount_Value{applicationId="application_1718714878734_1563526_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 16 1721132143020
metrics_diskBytesWritten_Value{applicationId="application_1718714878734_1563526_1",hostName="celeborn-worker",name="default",role="Worker",tenantId="default"} 1555862722 1721132143020
Ping @leixm, @RexXiong.
Ping @FMX, @RexXiong, @AngersZhuuuu.
LGTM.
Ping @RexXiong, @FMX.
You can come up with a new Jira ticket. CELEBORN-1292 focuses on removing application dimension resource consumption.
@FMX, I have updated commit message and pull request title with CELEBORN-1501. PTAL.
An excess of applications may result in the number of metrics exceeding the default capacity of 4096, which can lead to the metrics becoming unwieldy. To address this, we should implement a limit to ensure that the number of metrics remains manageable and does not surpass a reasonable threshold. And for these application metrics, only the TOP[XX] are valuable.
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.
@RexXiong, I have added celeborn.metrics.resourceConsumption.app.limit to prevent the total number of metrics from exceeding the metrics capacity. PTAL. cc @FMX.
Codecov Report
Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
Project coverage is 33.25%. Comparing base (
ea6617c) to head (1698b56). Report is 45 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| ...cala/org/apache/celeborn/common/CelebornConf.scala | 85.72% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2630 +/- ##
==========================================
- Coverage 39.83% 33.25% -6.58%
==========================================
Files 239 313 +74
Lines 15026 18282 +3256
Branches 1362 1678 +316
==========================================
+ Hits 5984 6077 +93
- Misses 8711 11865 +3154
- Partials 331 340 +9
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@FMX, @RexXiong, I have updated the implementation for TopN application dimension resource consumption metrics of Worker. PTAL.