dcgm-exporter
dcgm-exporter copied to clipboard
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Hi, I'm working on deploying dcgm exporter to several clusters that I operate. I noticed that the DaemonSet requests root privileges and I would rather that it didn't (https://github.com/NVIDIA/dcgm-exporter/blob/main/dcgm-exporter.yaml#L47). I...
Hi! I have a memory leak in the exporter. Dcgm-exporter: 3.3.5-3.4.0 Model: NVIDIA A30 Driver Version: 550.54.14 CUDA Version: 12.4
[DCGM_FI_DEV_COUNT](https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html?highlight=dev_count#c.DCGM_FI_DEV_COUNT) metric is exposed as a counter, here's an example response: ``` # HELP DCGM_FI_DEV_COUNT Number of Devices on the node. # TYPE DCGM_FI_DEV_COUNT counter DCGM_FI_DEV_COUNT{gpu="0",UUID="GPU-8afe0f31-4207-33ec-7e08-af8774375fee",device="nvidia0",modelName="NVIDIA H100 PCIe",Hostname="iscxh001.mskcc.org",DCGM_FI_CUDA_DRIVER_VERSION="12020",DCGM_FI_DEV_BRAND="NVIDIA",DCGM_FI_DEV_MINOR_NUMBER="0",DCGM_FI_DEV_NAME="NVIDIA H100 PCIe",DCGM_FI_DEV_SERIAL="1650723017032",DCGM_FI_DRIVER_VERSION="535.104.12",DCGM_FI_PROCESS_NAME="/usr/local/sbin/dcgm-exporter"}...
While setting up DCGM exporter I am getting a issue looks like something is conflciting I am not sure this is code side or not long time ago I have...
Since by default the exporter use the default csv file, the helm chart shouldn't create an unused configmap. I am not a helm expert but I feel this [file](https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/templates/metrics-configmap.yaml) will...
Hi there, I'm curious why the gpu drain state like the following is not included in the dcgm exporter: ``` Linux:~$ sudo nvidia-smi drain -p 0000:3f:00.0 -q The current drain...
I have a pod in status Completed, and I use a GPU card ‘kubectl describe node gpu-178‘ View and from exporte dissimilarity,Obviously, dcgm exporter has included the cards of the...
Hi, I've installed everything and is working well , but I realized that even with DCGM_FI_DEV_GPU_UTIL allowed in the map documents, this metric is not showing in prometheus and grafana....
Hi, DCGM team I am using the DCGM tool to profile my GPU job. The result showed like below: data:image/s3,"s3://crabby-images/148e2/148e2b6ed68805f1626f4003f0f7118b62e74dc5" alt="image" The SM occupancy is defined as "The ratio of number...
Running a [3.3.5-3.4.0 exporter ](https://github.com/NVIDIA/dcgm-exporter/releases/tag/3.3.5-3.4.0)on a 3.3.5 host-engine as shipped via nvidia-ubuntu-repos SEGFAULTs the Host-engine. Recorded here for completness, reportet to DCGM in the first place: https://github.com/NVIDIA/DCGM/issues/155