add xpu monitor for dlrover

Open majieyue opened this issue 1 year ago • 3 comments

Background

Dlrover is an elastic deep learning framework, with fault-tolerance of processes failure, POD losting etc. Since the LLM training is at large scale and always span for a long time, many errors occur without the processes failure above, but a long time hanging. During the hanging period, the xPU metrics and logs may help to detect such errors

Requirement

We need xPU metrics monitor running in elastic agent or running as daemonset on each node. The monitor collects xPU metrics such as xPU utilization, memory usage, temperature, tensor core usage, internal traffic such as nvlink and pcie etc. Although there are many xPU vendors in market, we can start from Nvidia...

Oct 12 '24 02:10 majieyue

Is there a specific usage document for xpu_timer? Is it strongly dependent on the Dlrover framework or can all training frameworks learn from it?

Oct 18 '24 01:10 aqwertaqwert

hi @aqwertaqwert The xPU is a acronym for GPGPUs in the market, not xpu_timer at all :) We recommend to start from Nvidia GPU, e.g. add some code to collect metrics from Nvidia DCGM or PyNVML

Nov 05 '24 10:11 majieyue

How about installing the DCGM exporter in the k8s cluster，and the dlrover reads metrics from the Prometheus Server？

Jul 07 '25 03:07 zhwentao