gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Incorrect GPU profile reported by DCGM exporter

Open lukeelten opened this issue 11 months ago • 0 comments

DCGM Exporter is reporting an incorrect GPU profile.

We configured an A100 80GB with the MiG profile all-7g.80gb which created one MiG profile which takes up to whole card. The MiG device is created properly and works as expected. The node shows the proper device as allocatable.

But DCGM Exporter reports the device as 7g.79gb which breaks some Grafana dashboards.

Installed operator version: 23.9.2 Driver Version: 535.154.05 kubernetes version: OpenShift 4.12.50 / kubernetes v1.25.16 kernel: 4.18.0-372.91.1.el8_6.x86_64 DCGM exporter image: nvcr.io/nvidia/k8s/dcgm-exporter@sha256:011fb450af3fa2e8fe5d28d590e4c653631447bc23d149591ced3d89089c4f2c

DCGM metrics (excerpt):

sh-5.1# curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-06357e72-c3ee-614b-775f-a89e18e8cc08",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="7g.79gb",GPU_I_ID="0",Hostname="metal-gpu-08",DCGM_FI_DRIVER_VERSION="535.154.05",container="",namespace="",pod=""} 1410
# HELP DCGM_FI_DEV_MAX_SM_CLOCK max sm clock.
# TYPE DCGM_FI_DEV_MAX_SM_CLOCK gauge
DCGM_FI_DEV_MAX_SM_CLOCK{gpu="0",UUID="GPU-06357e72-c3ee-614b-775f-a89e18e8cc08",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="7g.79gb",GPU_I_ID="0",Hostname="metal-gpu-08",DCGM_FI_DRIVER_VERSION="535.154.05",container="",namespace="",pod=""} 1410
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-06357e72-c3ee-614b-775f-a89e18e8cc08",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="7g.79gb",GPU_I_ID="0",Hostname="metal-gpu-08",DCGM_FI_DRIVER_VERSION="535.154.05",container="",namespace="",pod=""} 1512
# HELP DCGM_FI_DEV_MAX_MEM_CLOCK max mem clock.
# TYPE DCGM_FI_DEV_MAX_MEM_CLOCK gauge
DCGM_FI_DEV_MAX_MEM_CLOCK{gpu="0",UUID="GPU-06357e72-c3ee-614b-775f-a89e18e8cc08",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="7g.79gb",GPU_I_ID="0",Hostname="metal-gpu-08",DCGM_FI_DRIVER_VERSION="535.154.05",container="",namespace="",pod=""} 1512
....

nvidia-smi Output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:21:00.0 Off |                   On |
| N/A   37C    P0              77W / 300W |      6MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    0   0   0  |               6MiB / 81050MiB  | 98      0 |  7   0    5    1    1 |
|                  |               3MiB / 131072MiB |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

Cluster Policy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  vgpuDeviceManager:
    enabled: true
  migManager:
    enabled: true
    resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 10m
        memory: 100Mi
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
    resources:
      limits:
        memory: 512Mi
      requests:
        cpu: 50m
        memory: 256Mi
  gfd:
    enabled: true
    resources:
      limits:
        memory: 512Mi
      requests:
        cpu: 10m
        memory: 160Mi
  dcgmExporter:
    config:
      name: dcgm-exporter-config
    enabled: true
    resources:
      limits:
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 512Mi
    serviceMonitor:
      enabled: true
  driver:
    licensingConfig:
      configMapName: licensing-config
      nlsEnabled: true
    enabled: true
    resources:
      limits:
        memory: 6Gi
      requests:
        cpu: 10m
        memory: 2Gi
    certConfig:
      name: ''
    repository: our-registry.example.com
    kernelModuleConfig:
      name: ''
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: true
        enable: true
        force: false
        timeoutSeconds: 900
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: true
        force: false
        timeoutSeconds: 900
      waitForCompletion:
        timeoutSeconds: 0
    repoConfig:
      configMapName: ''
    version: 535.154.05-6
    virtualTopology:
      config: ''
    image: nvidia/driver
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
    resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 10m
        memory: 64Mi
  mig:
    strategy: mixed
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
    resources:
      limits:
        memory: 128Mi
      requests:
        cpu: 1m
        memory: 32Mi
  nodeStatusExporter:
    enabled: true
    resources:
      limits:
        memory: 512Mi
      requests:
        cpu: 15m
        memory: 128Mi
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  vgpuManager:
    enabled: true
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
    resources:
      limits:
        memory: 128Mi
      requests:
        cpu: 10m
        memory: 32Mi
status:
  conditions:
    - lastTransitionTime: '2024-03-25T13:24:29Z'
      message: >-
        ClusterPolicy is ready as all resources have been successfully
        reconciled
      reason: Reconciled
      status: 'True'
      type: Ready
    - lastTransitionTime: '2024-03-25T13:24:29Z'
      message: ''
      reason: Ready
      status: 'False'
      type: Error
  namespace: nvidia-gpu-operator
  state: ready

Node Resource:

kind: Node
apiVersion: v1
metadata:
  name: metal-gpu-08
  labels:
    feature.node.kubernetes.io/kernel-version.full: 4.18.0-372.91.1.el8_6.x86_64
    feature.node.kubernetes.io/cpu-cpuid.IBSFFV: 'true'
    nvidia.com/cuda.runtime.minor: '2'
    nvidia.com/mig-7g.80gb.memory: '80640'
    feature.node.kubernetes.io/cpu-rdt.RDTL3CA: 'true'
    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: '12'
    feature.node.kubernetes.io/cpu-cpuid.VTE: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SVMDA: 'true'
    feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ: 'true'
    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT: 'true'
    feature.node.kubernetes.io/pci-8086.sriov.capable: 'true'
    feature.node.kubernetes.io/cpu-cpuid.NRIPS: 'true'
    feature.node.kubernetes.io/cpu-cpuid.XSAVES: 'true'
    beta.kubernetes.io/os: linux
    feature.node.kubernetes.io/cpu-model.family: '25'
    feature.node.kubernetes.io/kernel-version.minor: '18'
    feature.node.kubernetes.io/cpu-cpuid.WBNOINVD: 'true'
    nvidia.com/gpu.machine: PowerEdge-R7525
    nvidia.com/gpu.memory: '81920'
    feature.node.kubernetes.io/cpu-cpuid.AESNI: 'true'
    nvidia.com/mig-7g.80gb.engines.jpeg: '1'
    nvidia.com/cuda.runtime.major: '12'
    nvidia.com/gpu.deploy.dcgm-exporter: 'true'
    kubernetes.io/os: linux
    feature.node.kubernetes.io/cpu-cpuid.SVM: 'true'
    feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SME: 'true'
    nvidia.com/gpu.deploy.device-plugin: 'true'
    nvidia.com/mig.config: all-7g.80gb
    nvidia.com/gpu.deploy.nvsm: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP: 'true'
    feature.node.kubernetes.io/system-os_release.VERSION_ID.major: '4'
    feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT: 'true'
    nvidia.com/gpu.deploy.mig-manager: 'true'
    nvidia.com/gpu.family: ampere
    feature.node.kubernetes.io/cpu-cpuid.SHA: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SVML: 'true'
    nvidia.com/mig-7g.80gb.engines.decoder: '5'
    node-role.kubernetes.io/baremetal: ''
    feature.node.kubernetes.io/cpu-cpuid.SVMNP: 'true'
    feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW: 'true'
    nvidia.com/mig.capable: 'true'
    feature.node.kubernetes.io/kernel-version.major: '4'
    feature.node.kubernetes.io/pci-102b.present: 'true'
    feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM: 'true'
    nvidia.com/gpu.deploy.dcgm: 'true'
    nvidia.com/mig.config.state: success
    feature.node.kubernetes.io/cpu-cpuid.SUCCOR: 'true'
    feature.node.kubernetes.io/cpu-cpuid.INVLPGB: 'true'
    feature.node.kubernetes.io/pci-14e4.present: 'true'
    feature.node.kubernetes.io/kernel-version.revision: '0'
    feature.node.kubernetes.io/storage-nonrotationaldisk: 'true'
    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE: 'true'
    node-role.kubernetes.io/worker: ''
    nvidia.com/gpu.count: '1'
    feature.node.kubernetes.io/cpu-cpuid.LBRVIRT: 'true'
    nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
    nvidia.com/gfd.timestamp: '1710459611'
    nvidia.com/gpu.deploy.driver: 'true'
    nvidia.com/cuda.driver.minor: '154'
    feature.node.kubernetes.io/cpu-cpuid.CETSS: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE: 'true'
    nvidia.com/mig-7g.80gb.engines.ofa: '1'
    feature.node.kubernetes.io/cpu-cpuid.SSE4A: 'true'
    feature.node.kubernetes.io/pci-10de.sriov.capable: 'true'
    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8: 'true'
    feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR: 'true'
    nvidia.com/gpu.deploy.operator-validator: 'true'
    feature.node.kubernetes.io/cpu-rdt.RDTCMT: 'true'
    feature.node.kubernetes.io/iommu-enabled: 'true'
    nvidia.com/gpu.deploy.container-toolkit: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SVMFBASID: 'true'
    feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN: 'true'
    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SEV_ES: 'true'
    nvidia.com/mig-7g.80gb.slices.gi: '7'
    feature.node.kubernetes.io/system-os_release.RHEL_VERSION: '8.6'
    nvidia.com/mig-7g.80gb.product: NVIDIA-A100-80GB-PCIe-MIG-7g.80gb
    nvidia.com/mig-7g.80gb.count: '1'
    feature.node.kubernetes.io/kernel-config.NO_HZ_FULL: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT: 'true'
    node.openshift.io/os_id: rhcos
    feature.node.kubernetes.io/pci-8086.present: 'true'
    feature.node.kubernetes.io/cpu-cpuid.XSAVE: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SCE: 'true'
    feature.node.kubernetes.io/cpu-cpuid.ADX: 'true'
    feature.node.kubernetes.io/cpu-rdt.RDTMON: 'true'
    nvidia.com/cuda.driver.major: '535'
    feature.node.kubernetes.io/cpu-cpuid.CPBOOST: 'true'
    feature.node.kubernetes.io/memory-numa: 'true'
    feature.node.kubernetes.io/cpu-cpuid.MOVBE: 'true'
    nvidia.com/mig-7g.80gb.engines.copy: '7'
    feature.node.kubernetes.io/cpu-cpuid.X87: 'true'
    nvidia.com/mig-7g.80gb.slices.ci: '7'
    feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD: 'true'
    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT: 'true'
    feature.node.kubernetes.io/cpu-cpuid.XSAVEC: 'true'
    feature.node.kubernetes.io/cpu-cpuid.MSRIRC: 'true'
    feature.node.kubernetes.io/cpu-model.vendor_id: AMD
    feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK: 'true'
    feature.node.kubernetes.io/cpu-cpuid.AVX2: 'true'
    kubernetes.io/hostname: lin-crete-metal-gpu-08
    nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    feature.node.kubernetes.io/system-os_release.ID: rhcos
    feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT: 'true'
    feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 412.86.202402131748-0
    beta.kubernetes.io/arch: amd64
    nvidia.com/gpu.deploy.node-status-exporter: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SEV_SNP: 'true'
    feature.node.kubernetes.io/cpu-cpuid.IBS: 'true'
    feature.node.kubernetes.io/cpu-cpuid.FXSR: 'true'
    feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST: 'true'
    feature.node.kubernetes.io/cpu-cpuid.CLZERO: 'true'
    kubernetes.io/arch: amd64
    nvidia.com/mig.strategy: mixed
    feature.node.kubernetes.io/cpu-cpuid.VMPL: 'true'
    feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION: '4.12'
    nvidia.com/gpu.present: 'true'
    feature.node.kubernetes.io/cpu-cpuid.LAHF: 'true'
    feature.node.kubernetes.io/pci-10de.present: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SVMPFT: 'true'
    nvidia.com/mig-7g.80gb.multiprocessors: '98'
    feature.node.kubernetes.io/cpu-model.id: '1'
    feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM: 'true'
    feature.node.kubernetes.io/cpu-cpuid.XGETBV1: 'true'
    feature.node.kubernetes.io/kernel-selinux.enabled: 'true'
    feature.node.kubernetes.io/system-os_release.VERSION_ID: '4.12'
    nvidia.com/mig-7g.80gb.replicas: '1'
    feature.node.kubernetes.io/network-sriov.capable: 'true'
    nvidia.com/cuda.driver.rev: '05'
    feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED: 'true'
    nvidia.com/gpu.compute.minor: '0'
    feature.node.kubernetes.io/cpu-cpuid.SVMPF: 'true'
    feature.node.kubernetes.io/cpu-rdt.RDTMBM: 'true'
    feature.node.kubernetes.io/cpu-cpuid.RDPRU: 'true'
    feature.node.kubernetes.io/cpu-hardware_multithreading: 'true'
    portworx.io/nobackup: 'true'
    feature.node.kubernetes.io/cpu-cpuid.FXSROPT: 'true'
    feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH: 'true'
    feature.node.kubernetes.io/cpu-cpuid.FMA3: 'true'
    feature.node.kubernetes.io/cpu-cpuid.VAES: 'true'
    feature.node.kubernetes.io/cpu-cpuid.AVX: 'true'
    nvidia.com/gpu.replicas: '0'
    feature.node.kubernetes.io/kernel-config.NO_HZ: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SEV: 'true'
    nvidia.com/gpu.compute.major: '8'
    nvidia.com/mig-7g.80gb.engines.encoder: '0'
    feature.node.kubernetes.io/cpu-cpuid.MCOMMIT: 'true'
status:
  capacity:
    cpu: '128'
    ephemeral-storage: 233829932Ki
    memory: 792179236Ki
    nvidia.com/mig-1g.10gb: '0'
    nvidia.com/mig-7g.80gb: '1'
  allocatable:
    cpu: '124'
    ephemeral-storage: '215497664975'
    memory: 760721956Ki
    nvidia.com/mig-1g.10gb: '0'
    nvidia.com/mig-7g.80gb: '1'

lukeelten avatar Mar 26 '24 09:03 lukeelten