DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

Cannot Retrieve GPU PIDs from DCGM Metrics

Open doronkg opened this issue 8 months ago • 0 comments

Ask your question

Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.

From the exported DCGM metrics, I saw no metric with a label representing GPU PID. In the DCGM release notes, the following is mentioned:

The following features have been dropped or deprecated starting with DCGM 3.0: The following field identifiers have been removed: DCGM_FI_DEV_GRAPHICS_PIDS DCGM_FI_DEV_COMPUTE_PIDS ...

My question - is there a way to retrieve this info in the current version? I originally submitted this issue to the DCGM Exporter GitHub repo.

The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:

$ nvidia-smi --query-compute-apps=pid --format=csv,noheader | xargs -I{} sh -c "echo -n '{}'; echo -n ','; grep -oPm1 '[0-9a-f]{8}(_[0-9a-f]{4}){3}_[0-9a-f]{12}' /proc/{}/cgroup | sed 's/_/-/g'" 
114855,c8b8d8a2-5e73-4c1a-b8e3-735e8a4e56d3
115044,1f7d9c8e-4a4b-455b-9b0d-9a2d1f4e6c2f

NOTE: It requires setting hostPid: true in the Pod spec.

Versions: OpenShift: v4.12.35 Kubernetes: v1.25.12+ba5cc25 NVIDIA GPU Operator: v23.3.2 DCGM Exporter: v3.1.7

doronkg avatar Jun 25 '24 16:06 doronkg