dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

can not collect gpu utilization metric when mig enable for some pods

Open melikeiremguler opened this issue 1 year ago • 2 comments

What is the version?

3.1.8-3.1.5-ubuntu20.04

What happened?

We have been using gpu-operator in Kubernetes cluster. Gpu-operator helm-chart version: gpu-operator-v23.6.1 Kubernetes version: v1.26.6

I have enabled mig for one node. You can see the node labels below. I deployed a test app. Also, you can see my test app yaml. When I port-forwarded dcgm-exporter's pod in the k8s-node-worker-2, I could see only 5 pod’s DCGM_FI_PROF_GR_ENGINE_ACTIVE metric is available.

kubectl port-forward pod/nvidia-dcgm-exporter-qttj5 9400:9400

Some pods have no metric but when I checked them I can see the usage. Also, this problem doesn’t occur with A100-80gb card.

 kubectl exec -it gpu-test-59cd4d464-jdk46 -- bash
root@gpu-test-59cd4d464-jdk46:/# nvidia-smi

The Node Labels

{
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512BW": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512CD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512DQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512F": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VL": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVXVNNIINT8": "true",
  "feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FSRM": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FXSR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.GFNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBRS": "true",
  "feature.node.kubernetes.io/cpu-cpuid.LAHF": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VAES": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VMX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.X87": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true",
  "feature.node.kubernetes.io/cpu-hardware_multithreading": "false",
  "feature.node.kubernetes.io/cpu-model.family": "6",
  "feature.node.kubernetes.io/cpu-model.id": "106",
  "feature.node.kubernetes.io/cpu-model.vendor_id": "Intel",
  "feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
  "feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE": "true",
  "feature.node.kubernetes.io/kernel-version.full": "5.15.0-94-generic",
  "feature.node.kubernetes.io/kernel-version.major": "5",
  "feature.node.kubernetes.io/kernel-version.minor": "15",
  "feature.node.kubernetes.io/kernel-version.revision": "0",
  "feature.node.kubernetes.io/pci-10de.present": "true",
  "feature.node.kubernetes.io/pci-1af4.present": "true",
  "feature.node.kubernetes.io/system-os_release.ID": "ubuntu",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID": "20.04",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "20",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "04",
  "k8slens-edit-resource-version": "v1",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "k8s-node-worker-2",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/gpu-operator": "",
  "nvidia.com/cuda.driver.major": "535",
  "nvidia.com/cuda.driver.minor": "104",
  "nvidia.com/cuda.driver.rev": "05",
  "nvidia.com/cuda.runtime.major": "12",
  "nvidia.com/cuda.runtime.minor": "2",
  "nvidia.com/gfd.timestamp": "1727789534",
  "nvidia.com/gpu-driver-upgrade-state": "upgrade-done",
  "nvidia.com/gpu.compute.major": "9",
  "nvidia.com/gpu.compute.minor": "0",
  "nvidia.com/gpu.count": "7",
  "nvidia.com/gpu.deploy.container-toolkit": "true",
  "nvidia.com/gpu.deploy.dcgm": "true",
  "nvidia.com/gpu.deploy.dcgm-exporter": "true",
  "nvidia.com/gpu.deploy.device-plugin": "true",
  "nvidia.com/gpu.deploy.driver": "true",
  "nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
  "nvidia.com/gpu.deploy.mig-manager": "true",
  "nvidia.com/gpu.deploy.node-status-exporter": "true",
  "nvidia.com/gpu.deploy.nvsm": "true",
  "nvidia.com/gpu.deploy.operator-validator": "true",
  "nvidia.com/gpu.engines.copy": "1",
  "nvidia.com/gpu.engines.decoder": "1",
  "nvidia.com/gpu.engines.encoder": "0",
  "nvidia.com/gpu.engines.jpeg": "1",
  "nvidia.com/gpu.engines.ofa": "0",
  "nvidia.com/gpu.family": "hopper",
  "nvidia.com/gpu.machine": "HPC",
  "nvidia.com/gpu.memory": "11008",
  "nvidia.com/gpu.multiprocessors": "16",
  "nvidia.com/gpu.present": "true",
  "nvidia.com/gpu.product": "NVIDIA-H100-NVL-MIG-1g.12gb",
  "nvidia.com/gpu.replicas": "1",
  "nvidia.com/gpu.slices.ci": "1",
  "nvidia.com/gpu.slices.gi": "1",
  "nvidia.com/mig.capable": "true",
  "nvidia.com/mig.config": "all-1g.12gb",
  "nvidia.com/mig.config.state": "success",
  "nvidia.com/mig.strategy": "single"
}

Test App

kind: Deployment
metadata:
  name: gpu-test
  labels:
    app: gpu-test
spec:
  replicas: 7
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      hostPID: true
      containers:
        - name: cuda-sample-vector-add
          image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
          command: ["/bin/bash", "-c", "--"]
          args:
            - while true; do /cuda-samples/vectorAdd; done
          resources:
           limits:
             nvidia.com/gpu: 1
      nodeSelector:
        kubernetes.io/hostname: k8s-node-worker-2

Port-forward Metric Output

# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="8",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-8mg7j"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="10",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-zlbl2"} 0.003227
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="11",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-pc27w"} 0.003653
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="12",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-gqzxm"} 0.003896
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="13",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-lt4fj"} 0.003856
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).

Usage of Pod With No Metric

root@gpu-test-59cd4d464-jdk46:/# nvidia-smi
Tue Oct  8 10:33:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 NVL                On  | 00000000:00:06.0 Off |                   On |
| N/A   70C    P0             127W / 400W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    7   0   0  |              20MiB / 11008MiB  | 16      0 |  1   0    1    0    1 |
|                  |               2MiB /     7MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    7    0    2011028      C   /cuda-samples/vectorAdd                      10MiB |
+---------------------------------------------------------------------------------------+

What did you expect to happen?

I should see the metric for all pods.

What is the GPU model?

h100-nvl

What is the environment?

DCGM-Exporter running on the pod

How did you deploy the dcgm-exporter and what is the configuration?

I use the GPU Operator.

How to reproduce the issue?

No response

Anything else we need to know?

No response

melikeiremguler avatar Oct 08 '24 11:10 melikeiremguler

I got same question in A100 PCIE with mig. And I can also find DCGM_FI_DEV_GPU_UTIL in /etc/dcgm-exporter/dcp-metrics-included.csv like below:

# Utilization (the sample period varies depending on the product) <br>
DCGM_FI_DEV_GPU_UTIL,  gauge, GPU utilization (in %)  
....

and it still have not GPU_UTIL metric from dcgm-expoter.

dcgm-exporter version

dcgm-exporter:3.3.8-3.6.0-ubuntu22.04

Natelu avatar Nov 05 '24 02:11 Natelu

Hello, I found https://github.com/NVIDIA/DCGM/issues/80 with reply

there are no plans to support DCGM_FI_DEV_GPU_UTIL for MIG instances. This metric is outdated and has several limitations. However, the new hardware now supports the same method as DCGM_FI_PROF_* metrics

P-Light avatar Feb 19 '25 16:02 P-Light