dcgm-exporter issues

xid_errors metric does not reset after error recovery unless dcgm-exporter is restarted

### What is the version? dcgm-exporter version: 3.3.8-3.6.0 DCGM version: 3.3.8 Driver version: 550.127.08 CUDA version: 12.4 ### What happened? When an XID error occurs on the GPU, the **xid_errors**...

zhangzq644

bug

Security Vulnerability: Kubernetes component: kubelet <= 1.29.12, 1.30.x <= 1.30.8, 1.31.x <= 1.31.4, 1.32.0 - Remote Command Injection Vulnerability - 1.29.13, 1.30.9, 1.31.5, 1.32.1

### What is the version? nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.4-ubi9 ### What happened? nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.4-ubi9 has vulnerable **kubelet** version ### What did you expect to happen? please update the kubelet pkg its latest patched version...

shwethadec01

bug

Do you need the cuda base image for DCGM exporter?

### Ask your question I was able to build the binary locally without cuda libarary. Of course, you would need dependencies like these: https://github.com/NVIDIA/dcgm-exporter/blob/56e30d623a023b12e9a2c14367c3b11a8b75693c/docker/Dockerfile.ubuntu#L73-L74 But then what do you need...

surajssd

question

After MIG raise error: unsupported character in float while parsing

1

### What is the version? 4.1.1-4.0.4 ### What happened? when I enabling MIG on the gpu host , I noticed that some metrics are converted incorrectly and are incompatible with...

tongyangyeyue

bug

fix(metrics): unsupported character in float while parsing

solve issue: https://github.com/NVIDIA/dcgm-exporter/issues/488 when I enabling MIG on the gpu host , I noticed that some metrics are converted incorrectly and are incompatible with Prometheus. Examples include: DCGM_FI_DEV_CUDA_VISIBLE_DEVICES_STR DCGM_FI_DEV_MIG_MODE DCGM_FI_DEV_SM_CLOCK

tongyangyeyue

Scope of statistics for DCGM_FI_PROF_PCIE_TX_BYTES

1

### Ask your question Does the value of the DCGM_FI_PROF_PCIE_TX_BYTES metric include the data transmitted through PCIe network devices (e.g., InfiniBand cards) to GPUs on other hosts?

marquis-wang

question

DCGM_FI_PROF_NVLINK_TX_BYTES can not be collected by h20

1

Dcgm exporter in some node print this metric, but the value is 0. Others not print this metric.

Azusa-Yuan

reload metrics server on config file change

2

### Is this a new feature, an improvement, or a change to existing functionality? Improvement ### Please provide a clear description of the problem this feature solves when config changes...

daveoy

enhancement

DCGM exporter does not export mps process id's where as it shows in nvidia-smi

### What is the version? 4.1.1-4.0.4 ### What happened? Below all metrics are missing mps process id's to be shown in dcgm exporter however it is seen in nvidia-smi tried...

Chennakesavulu5

bug

Use shareing GPU: MPS with KUBERNETES_VIRTUAL_GPUS

### What is the version? 4.1.1-4.0.4 ### What happened? I started several Pods on the A100 using mps, and running dcgm-exporter with env: KUBERNETES_VIRTUAL_GPUS: true,although each pod can be bound...

arthas3014

bug

dcgm-exporter
dcgm-exporter copied to clipboard

Metadata

xid_errors metric does not reset after error recovery unless dcgm-exporter is restarted

Security Vulnerability: Kubernetes component: kubelet <= 1.29.12, 1.30.x <= 1.30.8, 1.31.x <= 1.31.4, 1.32.0 - Remote Command Injection Vulnerability - 1.29.13, 1.30.9, 1.31.5, 1.32.1

Do you need the cuda base image for DCGM exporter?

After MIG raise error: unsupported character in float while parsing

fix(metrics): unsupported character in float while parsing

Scope of statistics for DCGM_FI_PROF_PCIE_TX_BYTES

DCGM_FI_PROF_NVLINK_TX_BYTES can not be collected by h20

reload metrics server on config file change

DCGM exporter does not export mps process id's where as it shows in nvidia-smi

Use shareing GPU: MPS with KUBERNETES_VIRTUAL_GPUS

← Metadata

Owner

Metadata

dcgm-exporter dcgm-exporter copied to clipboard

Metadata

← Metadata

Owner

Metadata

dcgm-exporter
dcgm-exporter copied to clipboard