dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

The pod and namespace information in the monitoring indicators of some Gpus occupied by Pods is empty

Open qingfenghcy opened this issue 6 months ago • 0 comments

What is the version?

3.1.8-3.1.5

What happened?

I have installed the daemonset of dcgm-exporter and gpu-nvidia in the k8s cluster, and now I have the ability to monitor GPU-related indicators. There are more than 200 nodes in the cluster. I find that T4 Gpus on some nodes have been occupied by Pods, but the pod and namespace fields in the monitoring indicator information are empty. I compared the configuration of the node information between the non-empty and empty configurations and found no difference. At the same time, the pod logs of dcgm and gpu-nvidia are not different and abnormal.

What did you expect to happen?

I want to know why and what to look for.

What is the GPU model?

Each machine has a T4 GPU

What is the environment?

A k8s cluster with 200+ bare metal nodes.

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

qingfenghcy avatar Aug 14 '24 03:08 qingfenghcy