can not collect gpu utilization metric when mig enable for some pods
What is the version?
3.1.8-3.1.5-ubuntu20.04
What happened?
We have been using gpu-operator in Kubernetes cluster. Gpu-operator helm-chart version: gpu-operator-v23.6.1 Kubernetes version: v1.26.6
I have enabled mig for one node. You can see the node labels below. I deployed a test app. Also, you can see my test app yaml. When I port-forwarded dcgm-exporter's pod in the k8s-node-worker-2, I could see only 5 pod’s DCGM_FI_PROF_GR_ENGINE_ACTIVE metric is available.
kubectl port-forward pod/nvidia-dcgm-exporter-qttj5 9400:9400
Some pods have no metric but when I checked them I can see the usage. Also, this problem doesn’t occur with A100-80gb card.
kubectl exec -it gpu-test-59cd4d464-jdk46 -- bash
root@gpu-test-59cd4d464-jdk46:/# nvidia-smi
The Node Labels
{
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/os": "linux",
"feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512BW": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512CD": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512DQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512F": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VL": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVXVNNIINT8": "true",
"feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true",
"feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
"feature.node.kubernetes.io/cpu-cpuid.FSRM": "true",
"feature.node.kubernetes.io/cpu-cpuid.FXSR": "true",
"feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true",
"feature.node.kubernetes.io/cpu-cpuid.GFNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR": "true",
"feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBRS": "true",
"feature.node.kubernetes.io/cpu-cpuid.LAHF": "true",
"feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR": "true",
"feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true",
"feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true",
"feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
"feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true",
"feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
"feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true",
"feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true",
"feature.node.kubernetes.io/cpu-cpuid.VAES": "true",
"feature.node.kubernetes.io/cpu-cpuid.VMX": "true",
"feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
"feature.node.kubernetes.io/cpu-cpuid.X87": "true",
"feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true",
"feature.node.kubernetes.io/cpu-hardware_multithreading": "false",
"feature.node.kubernetes.io/cpu-model.family": "6",
"feature.node.kubernetes.io/cpu-model.id": "106",
"feature.node.kubernetes.io/cpu-model.vendor_id": "Intel",
"feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
"feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE": "true",
"feature.node.kubernetes.io/kernel-version.full": "5.15.0-94-generic",
"feature.node.kubernetes.io/kernel-version.major": "5",
"feature.node.kubernetes.io/kernel-version.minor": "15",
"feature.node.kubernetes.io/kernel-version.revision": "0",
"feature.node.kubernetes.io/pci-10de.present": "true",
"feature.node.kubernetes.io/pci-1af4.present": "true",
"feature.node.kubernetes.io/system-os_release.ID": "ubuntu",
"feature.node.kubernetes.io/system-os_release.VERSION_ID": "20.04",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "20",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "04",
"k8slens-edit-resource-version": "v1",
"kubernetes.io/arch": "amd64",
"kubernetes.io/hostname": "k8s-node-worker-2",
"kubernetes.io/os": "linux",
"node-role.kubernetes.io/gpu-operator": "",
"nvidia.com/cuda.driver.major": "535",
"nvidia.com/cuda.driver.minor": "104",
"nvidia.com/cuda.driver.rev": "05",
"nvidia.com/cuda.runtime.major": "12",
"nvidia.com/cuda.runtime.minor": "2",
"nvidia.com/gfd.timestamp": "1727789534",
"nvidia.com/gpu-driver-upgrade-state": "upgrade-done",
"nvidia.com/gpu.compute.major": "9",
"nvidia.com/gpu.compute.minor": "0",
"nvidia.com/gpu.count": "7",
"nvidia.com/gpu.deploy.container-toolkit": "true",
"nvidia.com/gpu.deploy.dcgm": "true",
"nvidia.com/gpu.deploy.dcgm-exporter": "true",
"nvidia.com/gpu.deploy.device-plugin": "true",
"nvidia.com/gpu.deploy.driver": "true",
"nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
"nvidia.com/gpu.deploy.mig-manager": "true",
"nvidia.com/gpu.deploy.node-status-exporter": "true",
"nvidia.com/gpu.deploy.nvsm": "true",
"nvidia.com/gpu.deploy.operator-validator": "true",
"nvidia.com/gpu.engines.copy": "1",
"nvidia.com/gpu.engines.decoder": "1",
"nvidia.com/gpu.engines.encoder": "0",
"nvidia.com/gpu.engines.jpeg": "1",
"nvidia.com/gpu.engines.ofa": "0",
"nvidia.com/gpu.family": "hopper",
"nvidia.com/gpu.machine": "HPC",
"nvidia.com/gpu.memory": "11008",
"nvidia.com/gpu.multiprocessors": "16",
"nvidia.com/gpu.present": "true",
"nvidia.com/gpu.product": "NVIDIA-H100-NVL-MIG-1g.12gb",
"nvidia.com/gpu.replicas": "1",
"nvidia.com/gpu.slices.ci": "1",
"nvidia.com/gpu.slices.gi": "1",
"nvidia.com/mig.capable": "true",
"nvidia.com/mig.config": "all-1g.12gb",
"nvidia.com/mig.config.state": "success",
"nvidia.com/mig.strategy": "single"
}
Test App
kind: Deployment
metadata:
name: gpu-test
labels:
app: gpu-test
spec:
replicas: 7
selector:
matchLabels:
app: gpu-test
template:
metadata:
labels:
app: gpu-test
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
hostPID: true
containers:
- name: cuda-sample-vector-add
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
command: ["/bin/bash", "-c", "--"]
args:
- while true; do /cuda-samples/vectorAdd; done
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
kubernetes.io/hostname: k8s-node-worker-2
Port-forward Metric Output
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="8",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-8mg7j"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="10",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-zlbl2"} 0.003227
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="11",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-pc27w"} 0.003653
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="12",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-gqzxm"} 0.003896
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="13",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-lt4fj"} 0.003856
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
Usage of Pod With No Metric
root@gpu-test-59cd4d464-jdk46:/# nvidia-smi
Tue Oct 8 10:33:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 NVL On | 00000000:00:06.0 Off | On |
| N/A 70C P0 127W / 400W | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 7 0 0 | 20MiB / 11008MiB | 16 0 | 1 0 1 0 1 |
| | 2MiB / 7MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 7 0 2011028 C /cuda-samples/vectorAdd 10MiB |
+---------------------------------------------------------------------------------------+
What did you expect to happen?
I should see the metric for all pods.
What is the GPU model?
h100-nvl
What is the environment?
DCGM-Exporter running on the pod
How did you deploy the dcgm-exporter and what is the configuration?
I use the GPU Operator.
How to reproduce the issue?
No response
Anything else we need to know?
No response
I got same question in A100 PCIE with mig. And I can also find DCGM_FI_DEV_GPU_UTIL in /etc/dcgm-exporter/dcp-metrics-included.csv like below:
# Utilization (the sample period varies depending on the product) <br>
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %)
....
and it still have not GPU_UTIL metric from dcgm-expoter.
dcgm-exporter version
dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
Hello, I found https://github.com/NVIDIA/DCGM/issues/80 with reply
there are no plans to support DCGM_FI_DEV_GPU_UTIL for MIG instances. This metric is outdated and has several limitations. However, the new hardware now supports the same method as DCGM_FI_PROF_* metrics