dcgm-exporter
dcgm-exporter copied to clipboard
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
### What is the version? dcgm-exporter version: 3.3.8-3.6.0 DCGM version: 3.3.8 Driver version: 550.127.08 CUDA version: 12.4 ### What happened? When an XID error occurs on the GPU, the **xid_errors**...
### What is the version? nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.4-ubi9 ### What happened? nvcr.io/nvidia/k8s/dcgm-exporter:4.1.1-4.0.4-ubi9 has vulnerable **kubelet** version ### What did you expect to happen? please update the kubelet pkg its latest patched version...
### Ask your question I was able to build the binary locally without cuda libarary. Of course, you would need dependencies like these: https://github.com/NVIDIA/dcgm-exporter/blob/56e30d623a023b12e9a2c14367c3b11a8b75693c/docker/Dockerfile.ubuntu#L73-L74 But then what do you need...
### What is the version? 4.1.1-4.0.4 ### What happened? when I enabling MIG on the gpu host , I noticed that some metrics are converted incorrectly and are incompatible with...
solve issue: https://github.com/NVIDIA/dcgm-exporter/issues/488 when I enabling MIG on the gpu host , I noticed that some metrics are converted incorrectly and are incompatible with Prometheus. Examples include: DCGM_FI_DEV_CUDA_VISIBLE_DEVICES_STR DCGM_FI_DEV_MIG_MODE DCGM_FI_DEV_SM_CLOCK
### Ask your question Does the value of the DCGM_FI_PROF_PCIE_TX_BYTES metric include the data transmitted through PCIe network devices (e.g., InfiniBand cards) to GPUs on other hosts?
Dcgm exporter in some node print this metric, but the value is 0. Others not print this metric.
### Is this a new feature, an improvement, or a change to existing functionality? Improvement ### Please provide a clear description of the problem this feature solves when config changes...
### What is the version? 4.1.1-4.0.4 ### What happened? Below all metrics are missing mps process id's to be shown in dcgm exporter however it is seen in nvidia-smi tried...
### What is the version? 4.1.1-4.0.4 ### What happened? I started several Pods on the A100 using mps, and running dcgm-exporter with env: KUBERNETES_VIRTUAL_GPUS: true,although each pod can be bound...