add xpu monitor for dlrover
Background
Dlrover is an elastic deep learning framework, with fault-tolerance of processes failure, POD losting etc. Since the LLM training is at large scale and always span for a long time, many errors occur without the processes failure above, but a long time hanging. During the hanging period, the xPU metrics and logs may help to detect such errors
Requirement
We need xPU metrics monitor running in elastic agent or running as daemonset on each node. The monitor collects xPU metrics such as xPU utilization, memory usage, temperature, tensor core usage, internal traffic such as nvlink and pcie etc. Although there are many xPU vendors in market, we can start from Nvidia...
Is there a specific usage document for xpu_timer? Is it strongly dependent on the Dlrover framework or can all training frameworks learn from it?
hi @aqwertaqwert The xPU is a acronym for GPGPUs in the market, not xpu_timer at all :) We recommend to start from Nvidia GPU, e.g. add some code to collect metrics from Nvidia DCGM or PyNVML
How about installing the DCGM exporter in the k8s cluster,and the dlrover reads metrics from the Prometheus Server?