spark-rapids
spark-rapids copied to clipboard
[FEA] Try to find some useful Alluxio Metrics and document them
Is your feature request related to a problem? Please describe.
This issue is from: https://github.com/NVIDIA/spark-rapids/pull/7172#discussion_r1036170523 Alluxio has hundreds of metrics, refer to https://docs.alluxio.io/os/user/stable/en/reference/Metrics-List.html. Some of them are useful for monitoring performance.
I already got some experience, but it's not enough:
Master.TotalPaths:
The total number of files and directories in the Alluxio namespace.
If this is large, the Alluxio master server will consume more memories.
Master.TotalRpcs:
Throughput of master RPC calls. This metric indicates how busy the master is serving the client and worker requests
This metric can be used to identify if the Alluxio master is running slowly.
Note: For the task-time replacement algorithm, there are no RPCs.
Master.JournalFlushTimer:
The timer statistics of the journal flush
If IO is a bottleneck, the value will be high.
I tried the cluster with 1 g4dn.xlarge
node for driver and 3 g4dn.xlarge
for workers, but can't reproduce the high CPU utilization.
I enabled the CSV sinks with a 5s flush period, and the IO utilization is not high.
Perhaps it's because the file number is small in the NDS data source.
Need more investigations to find some useful metrics.
this is the same as https://github.com/NVIDIA/spark-rapids/issues/6463
We are using filecache instead of Alluxio, so unassign from me. Suggest close it.