spark-rapids [FEA] Try to find some useful Alluxio Metrics and document them

[FEA] Try to find some useful Alluxio Metrics and document them

Open res-life opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe.

This issue is from: https://github.com/NVIDIA/spark-rapids/pull/7172#discussion_r1036170523 Alluxio has hundreds of metrics, refer to https://docs.alluxio.io/os/user/stable/en/reference/Metrics-List.html. Some of them are useful for monitoring performance.

I already got some experience, but it's not enough:

Master.TotalPaths: 
The total number of files and directories in the Alluxio namespace.
If this is large, the Alluxio master server will consume more memories.

Master.TotalRpcs: 
Throughput of master RPC calls. This metric indicates how busy the master is serving the client and worker requests
This metric can be used to identify if the Alluxio master is running slowly.
Note: For the task-time replacement algorithm, there are no RPCs.

Master.JournalFlushTimer: 
The timer statistics of the journal flush
If IO is a bottleneck, the value will be high.

I tried the cluster with 1 g4dn.xlarge node for driver and 3 g4dn.xlarge for workers, but can't reproduce the high CPU utilization.

I enabled the CSV sinks with a 5s flush period, and the IO utilization is not high.

Perhaps it's because the file number is small in the NDS data source.

Need more investigations to find some useful metrics.

Dec 05 '22 10:12 res-life

this is the same as https://github.com/NVIDIA/spark-rapids/issues/6463

Dec 05 '22 14:12 tgravescs

We are using filecache instead of Alluxio, so unassign from me. Suggest close it.

Mar 15 '24 01:03 res-life

spark-rapids spark-rapids copied to clipboard

[FEA] Try to find some useful Alluxio Metrics and document them

spark-rapids
spark-rapids copied to clipboard