dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

Results 37 dcgm-exporter issues
Sort by recently updated
recently updated
newest added

### What is the version? 3.3.5 ### What happened? GPU: A30 GPU Driver: 470.103.01 When I added DCGM_EXP_CLOCK_EVENTS_COUNT to collect the data from A30 MIG 6G, it failed and showed...

bug

Currently is doesn't seem like container/pod/namespace information is emitted from dcgm-exporter when MIG is enabled in GPU. This is important when we need to do gpu utilization aggregation across containers/cgroups....

bug

### Ask your question Hi, I am hoping to understand the difference between the `dcgmi -v` version and the version of `dcgm exporter` which should be used. I want to...

question

### What is the version? 3.3.3-3.3.0, 3.3.5-3.4.1 ### What happened? I can not see profiling metrics like is in the documentation, [NVLink Bandwidth](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#profiling-metrics), PROF_NVLINK_TX_BYTES, PROF_NVLINK_RX_BYTES, but I can see for...

bug

The DCGM_FI_DEV_XID_ERRORS metric reports xid error code value, this commit include a err_msg label with value retrieved from this nvidia doc: https://docs.nvidia.com/deploy/xid-errors/#topic_4

I'm not using K8S but want to collect container name as part of metrics. Each job is run in a container and the container name matches the jobid we want...

enhancement

### What is the version? 3.3.5-3.4.1 ### What happened? Upgraded from k8s 1.24 to 1.25 and dcgm-exporter from 3.3.3-3.3.1 to 3.3.5-3.4.1. The dcgm-exporter pod is now in Crashloopbackoff with this...

bug

### What happened? Unable to collect GPU metrics for relevant pods when using passthrough mode. For example, dcgm-exporter does not collect metrics when a VM created with kubevirt mounts a...

enhancement

### What is the version? 3.3.5-3.4.0 ### What happened? After upgrading from 3.1.3-3.1.2 to 3.3.5-3.4.0, GPU temperature metric DCGM_FI_DEV_GPU_TEMP occationally reports extremely large number, ex: 345, 82505, 200000, 644245923. It...

bug

### What is the version? 3.3.5-3.4.1-ubuntu22.04 ### What happened? I'm using the lastest dcgm-exporter 3.3.5-3.4.1-ubuntu22.04 , and expose metric DCGM_EXP_XID_ERRORS_COUNT by override env DCGM_EXPORTER_COLLECTORS. I can see this in log...

bug