dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

add xpu monitor for dlrover

Open majieyue opened this issue 1 year ago • 3 comments

Background

Dlrover is an elastic deep learning framework, with fault-tolerance of processes failure, POD losting etc. Since the LLM training is at large scale and always span for a long time, many errors occur without the processes failure above, but a long time hanging. During the hanging period, the xPU metrics and logs may help to detect such errors

Requirement

We need xPU metrics monitor running in elastic agent or running as daemonset on each node. The monitor collects xPU metrics such as xPU utilization, memory usage, temperature, tensor core usage, internal traffic such as nvlink and pcie etc. Although there are many xPU vendors in market, we can start from Nvidia...

majieyue avatar Oct 12 '24 02:10 majieyue

Is there a specific usage document for xpu_timer? Is it strongly dependent on the Dlrover framework or can all training frameworks learn from it?

aqwertaqwert avatar Oct 18 '24 01:10 aqwertaqwert

hi @aqwertaqwert The xPU is a acronym for GPGPUs in the market, not xpu_timer at all :) We recommend to start from Nvidia GPU, e.g. add some code to collect metrics from Nvidia DCGM or PyNVML

majieyue avatar Nov 05 '24 10:11 majieyue

How about installing the DCGM exporter in the k8s cluster,and the dlrover reads metrics from the Prometheus Server?

zhwentao avatar Jul 07 '25 03:07 zhwentao