DCGM
DCGM copied to clipboard
Cannot Retrieve GPU PIDs from DCGM Metrics
Ask your question
Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.
From the exported DCGM metrics, I saw no metric with a label representing GPU PID. In the DCGM release notes, the following is mentioned:
The following features have been dropped or deprecated starting with DCGM 3.0: The following field identifiers have been removed: DCGM_FI_DEV_GRAPHICS_PIDS DCGM_FI_DEV_COMPUTE_PIDS ...
My question - is there a way to retrieve this info in the current version? I originally submitted this issue to the DCGM Exporter GitHub repo.
The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:
$ nvidia-smi --query-compute-apps=pid --format=csv,noheader | xargs -I{} sh -c "echo -n '{}'; echo -n ','; grep -oPm1 '[0-9a-f]{8}(_[0-9a-f]{4}){3}_[0-9a-f]{12}' /proc/{}/cgroup | sed 's/_/-/g'"
114855,c8b8d8a2-5e73-4c1a-b8e3-735e8a4e56d3
115044,1f7d9c8e-4a4b-455b-9b0d-9a2d1f4e6c2f
NOTE: It requires setting
hostPid: true
in the Pod spec.
Versions: OpenShift: v4.12.35 Kubernetes: v1.25.12+ba5cc25 NVIDIA GPU Operator: v23.3.2 DCGM Exporter: v3.1.7