gpu-operator
gpu-operator copied to clipboard
Incorrect GPU profile reported by DCGM exporter
DCGM Exporter is reporting an incorrect GPU profile.
We configured an A100 80GB with the MiG profile all-7g.80gb
which created one MiG profile which takes up to whole card.
The MiG device is created properly and works as expected. The node shows the proper device as allocatable.
But DCGM Exporter reports the device as 7g.79gb
which breaks some Grafana dashboards.
Installed operator version: 23.9.2
Driver Version: 535.154.05
kubernetes version: OpenShift 4.12.50 / kubernetes v1.25.16
kernel: 4.18.0-372.91.1.el8_6.x86_64
DCGM exporter image: nvcr.io/nvidia/k8s/dcgm-exporter@sha256:011fb450af3fa2e8fe5d28d590e4c653631447bc23d149591ced3d89089c4f2c
DCGM metrics (excerpt):
sh-5.1# curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-06357e72-c3ee-614b-775f-a89e18e8cc08",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="7g.79gb",GPU_I_ID="0",Hostname="metal-gpu-08",DCGM_FI_DRIVER_VERSION="535.154.05",container="",namespace="",pod=""} 1410
# HELP DCGM_FI_DEV_MAX_SM_CLOCK max sm clock.
# TYPE DCGM_FI_DEV_MAX_SM_CLOCK gauge
DCGM_FI_DEV_MAX_SM_CLOCK{gpu="0",UUID="GPU-06357e72-c3ee-614b-775f-a89e18e8cc08",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="7g.79gb",GPU_I_ID="0",Hostname="metal-gpu-08",DCGM_FI_DRIVER_VERSION="535.154.05",container="",namespace="",pod=""} 1410
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-06357e72-c3ee-614b-775f-a89e18e8cc08",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="7g.79gb",GPU_I_ID="0",Hostname="metal-gpu-08",DCGM_FI_DRIVER_VERSION="535.154.05",container="",namespace="",pod=""} 1512
# HELP DCGM_FI_DEV_MAX_MEM_CLOCK max mem clock.
# TYPE DCGM_FI_DEV_MAX_MEM_CLOCK gauge
DCGM_FI_DEV_MAX_MEM_CLOCK{gpu="0",UUID="GPU-06357e72-c3ee-614b-775f-a89e18e8cc08",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="7g.79gb",GPU_I_ID="0",Hostname="metal-gpu-08",DCGM_FI_DRIVER_VERSION="535.154.05",container="",namespace="",pod=""} 1512
....
nvidia-smi Output:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:21:00.0 Off | On |
| N/A 37C P0 77W / 300W | 6MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 0 0 0 | 6MiB / 81050MiB | 98 0 | 7 0 5 1 1 |
| | 3MiB / 131072MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
Cluster Policy:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
vgpuDeviceManager:
enabled: true
migManager:
enabled: true
resources:
limits:
memory: 256Mi
requests:
cpu: 10m
memory: 100Mi
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
resources:
limits:
memory: 512Mi
requests:
cpu: 50m
memory: 256Mi
gfd:
enabled: true
resources:
limits:
memory: 512Mi
requests:
cpu: 10m
memory: 160Mi
dcgmExporter:
config:
name: dcgm-exporter-config
enabled: true
resources:
limits:
memory: 1Gi
requests:
cpu: 10m
memory: 512Mi
serviceMonitor:
enabled: true
driver:
licensingConfig:
configMapName: licensing-config
nlsEnabled: true
enabled: true
resources:
limits:
memory: 6Gi
requests:
cpu: 10m
memory: 2Gi
certConfig:
name: ''
repository: our-registry.example.com
kernelModuleConfig:
name: ''
upgradePolicy:
autoUpgrade: false
drain:
deleteEmptyDir: true
enable: true
force: false
timeoutSeconds: 900
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: true
force: false
timeoutSeconds: 900
waitForCompletion:
timeoutSeconds: 0
repoConfig:
configMapName: ''
version: 535.154.05-6
virtualTopology:
config: ''
image: nvidia/driver
devicePlugin:
config:
default: ''
name: ''
enabled: true
resources:
limits:
memory: 256Mi
requests:
cpu: 10m
memory: 64Mi
mig:
strategy: mixed
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'true'
resources:
limits:
memory: 128Mi
requests:
cpu: 1m
memory: 32Mi
nodeStatusExporter:
enabled: true
resources:
limits:
memory: 512Mi
requests:
cpu: 15m
memory: 128Mi
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
vgpuManager:
enabled: true
vfioManager:
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
resources:
limits:
memory: 128Mi
requests:
cpu: 10m
memory: 32Mi
status:
conditions:
- lastTransitionTime: '2024-03-25T13:24:29Z'
message: >-
ClusterPolicy is ready as all resources have been successfully
reconciled
reason: Reconciled
status: 'True'
type: Ready
- lastTransitionTime: '2024-03-25T13:24:29Z'
message: ''
reason: Ready
status: 'False'
type: Error
namespace: nvidia-gpu-operator
state: ready
Node Resource:
kind: Node
apiVersion: v1
metadata:
name: metal-gpu-08
labels:
feature.node.kubernetes.io/kernel-version.full: 4.18.0-372.91.1.el8_6.x86_64
feature.node.kubernetes.io/cpu-cpuid.IBSFFV: 'true'
nvidia.com/cuda.runtime.minor: '2'
nvidia.com/mig-7g.80gb.memory: '80640'
feature.node.kubernetes.io/cpu-rdt.RDTL3CA: 'true'
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: '12'
feature.node.kubernetes.io/cpu-cpuid.VTE: 'true'
feature.node.kubernetes.io/cpu-cpuid.SVMDA: 'true'
feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ: 'true'
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT: 'true'
feature.node.kubernetes.io/pci-8086.sriov.capable: 'true'
feature.node.kubernetes.io/cpu-cpuid.NRIPS: 'true'
feature.node.kubernetes.io/cpu-cpuid.XSAVES: 'true'
beta.kubernetes.io/os: linux
feature.node.kubernetes.io/cpu-model.family: '25'
feature.node.kubernetes.io/kernel-version.minor: '18'
feature.node.kubernetes.io/cpu-cpuid.WBNOINVD: 'true'
nvidia.com/gpu.machine: PowerEdge-R7525
nvidia.com/gpu.memory: '81920'
feature.node.kubernetes.io/cpu-cpuid.AESNI: 'true'
nvidia.com/mig-7g.80gb.engines.jpeg: '1'
nvidia.com/cuda.runtime.major: '12'
nvidia.com/gpu.deploy.dcgm-exporter: 'true'
kubernetes.io/os: linux
feature.node.kubernetes.io/cpu-cpuid.SVM: 'true'
feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT: 'true'
feature.node.kubernetes.io/cpu-cpuid.SME: 'true'
nvidia.com/gpu.deploy.device-plugin: 'true'
nvidia.com/mig.config: all-7g.80gb
nvidia.com/gpu.deploy.nvsm: 'true'
feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP: 'true'
feature.node.kubernetes.io/system-os_release.VERSION_ID.major: '4'
feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT: 'true'
nvidia.com/gpu.deploy.mig-manager: 'true'
nvidia.com/gpu.family: ampere
feature.node.kubernetes.io/cpu-cpuid.SHA: 'true'
feature.node.kubernetes.io/cpu-cpuid.SVML: 'true'
nvidia.com/mig-7g.80gb.engines.decoder: '5'
node-role.kubernetes.io/baremetal: ''
feature.node.kubernetes.io/cpu-cpuid.SVMNP: 'true'
feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW: 'true'
nvidia.com/mig.capable: 'true'
feature.node.kubernetes.io/kernel-version.major: '4'
feature.node.kubernetes.io/pci-102b.present: 'true'
feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM: 'true'
nvidia.com/gpu.deploy.dcgm: 'true'
nvidia.com/mig.config.state: success
feature.node.kubernetes.io/cpu-cpuid.SUCCOR: 'true'
feature.node.kubernetes.io/cpu-cpuid.INVLPGB: 'true'
feature.node.kubernetes.io/pci-14e4.present: 'true'
feature.node.kubernetes.io/kernel-version.revision: '0'
feature.node.kubernetes.io/storage-nonrotationaldisk: 'true'
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE: 'true'
node-role.kubernetes.io/worker: ''
nvidia.com/gpu.count: '1'
feature.node.kubernetes.io/cpu-cpuid.LBRVIRT: 'true'
nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
nvidia.com/gfd.timestamp: '1710459611'
nvidia.com/gpu.deploy.driver: 'true'
nvidia.com/cuda.driver.minor: '154'
feature.node.kubernetes.io/cpu-cpuid.CETSS: 'true'
feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE: 'true'
nvidia.com/mig-7g.80gb.engines.ofa: '1'
feature.node.kubernetes.io/cpu-cpuid.SSE4A: 'true'
feature.node.kubernetes.io/pci-10de.sriov.capable: 'true'
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8: 'true'
feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR: 'true'
nvidia.com/gpu.deploy.operator-validator: 'true'
feature.node.kubernetes.io/cpu-rdt.RDTCMT: 'true'
feature.node.kubernetes.io/iommu-enabled: 'true'
nvidia.com/gpu.deploy.container-toolkit: 'true'
feature.node.kubernetes.io/cpu-cpuid.SVMFBASID: 'true'
feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN: 'true'
feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT: 'true'
feature.node.kubernetes.io/cpu-cpuid.SEV_ES: 'true'
nvidia.com/mig-7g.80gb.slices.gi: '7'
feature.node.kubernetes.io/system-os_release.RHEL_VERSION: '8.6'
nvidia.com/mig-7g.80gb.product: NVIDIA-A100-80GB-PCIe-MIG-7g.80gb
nvidia.com/mig-7g.80gb.count: '1'
feature.node.kubernetes.io/kernel-config.NO_HZ_FULL: 'true'
feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT: 'true'
node.openshift.io/os_id: rhcos
feature.node.kubernetes.io/pci-8086.present: 'true'
feature.node.kubernetes.io/cpu-cpuid.XSAVE: 'true'
feature.node.kubernetes.io/cpu-cpuid.SCE: 'true'
feature.node.kubernetes.io/cpu-cpuid.ADX: 'true'
feature.node.kubernetes.io/cpu-rdt.RDTMON: 'true'
nvidia.com/cuda.driver.major: '535'
feature.node.kubernetes.io/cpu-cpuid.CPBOOST: 'true'
feature.node.kubernetes.io/memory-numa: 'true'
feature.node.kubernetes.io/cpu-cpuid.MOVBE: 'true'
nvidia.com/mig-7g.80gb.engines.copy: '7'
feature.node.kubernetes.io/cpu-cpuid.X87: 'true'
nvidia.com/mig-7g.80gb.slices.ci: '7'
feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD: 'true'
feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT: 'true'
feature.node.kubernetes.io/cpu-cpuid.XSAVEC: 'true'
feature.node.kubernetes.io/cpu-cpuid.MSRIRC: 'true'
feature.node.kubernetes.io/cpu-model.vendor_id: AMD
feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK: 'true'
feature.node.kubernetes.io/cpu-cpuid.AVX2: 'true'
kubernetes.io/hostname: lin-crete-metal-gpu-08
nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
feature.node.kubernetes.io/system-os_release.ID: rhcos
feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT: 'true'
feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 412.86.202402131748-0
beta.kubernetes.io/arch: amd64
nvidia.com/gpu.deploy.node-status-exporter: 'true'
feature.node.kubernetes.io/cpu-cpuid.SEV_SNP: 'true'
feature.node.kubernetes.io/cpu-cpuid.IBS: 'true'
feature.node.kubernetes.io/cpu-cpuid.FXSR: 'true'
feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST: 'true'
feature.node.kubernetes.io/cpu-cpuid.CLZERO: 'true'
kubernetes.io/arch: amd64
nvidia.com/mig.strategy: mixed
feature.node.kubernetes.io/cpu-cpuid.VMPL: 'true'
feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION: '4.12'
nvidia.com/gpu.present: 'true'
feature.node.kubernetes.io/cpu-cpuid.LAHF: 'true'
feature.node.kubernetes.io/pci-10de.present: 'true'
feature.node.kubernetes.io/cpu-cpuid.SVMPFT: 'true'
nvidia.com/mig-7g.80gb.multiprocessors: '98'
feature.node.kubernetes.io/cpu-model.id: '1'
feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM: 'true'
feature.node.kubernetes.io/cpu-cpuid.XGETBV1: 'true'
feature.node.kubernetes.io/kernel-selinux.enabled: 'true'
feature.node.kubernetes.io/system-os_release.VERSION_ID: '4.12'
nvidia.com/mig-7g.80gb.replicas: '1'
feature.node.kubernetes.io/network-sriov.capable: 'true'
nvidia.com/cuda.driver.rev: '05'
feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT: 'true'
feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED: 'true'
nvidia.com/gpu.compute.minor: '0'
feature.node.kubernetes.io/cpu-cpuid.SVMPF: 'true'
feature.node.kubernetes.io/cpu-rdt.RDTMBM: 'true'
feature.node.kubernetes.io/cpu-cpuid.RDPRU: 'true'
feature.node.kubernetes.io/cpu-hardware_multithreading: 'true'
portworx.io/nobackup: 'true'
feature.node.kubernetes.io/cpu-cpuid.FXSROPT: 'true'
feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH: 'true'
feature.node.kubernetes.io/cpu-cpuid.FMA3: 'true'
feature.node.kubernetes.io/cpu-cpuid.VAES: 'true'
feature.node.kubernetes.io/cpu-cpuid.AVX: 'true'
nvidia.com/gpu.replicas: '0'
feature.node.kubernetes.io/kernel-config.NO_HZ: 'true'
feature.node.kubernetes.io/cpu-cpuid.SEV: 'true'
nvidia.com/gpu.compute.major: '8'
nvidia.com/mig-7g.80gb.engines.encoder: '0'
feature.node.kubernetes.io/cpu-cpuid.MCOMMIT: 'true'
status:
capacity:
cpu: '128'
ephemeral-storage: 233829932Ki
memory: 792179236Ki
nvidia.com/mig-1g.10gb: '0'
nvidia.com/mig-7g.80gb: '1'
allocatable:
cpu: '124'
ephemeral-storage: '215497664975'
memory: 760721956Ki
nvidia.com/mig-1g.10gb: '0'
nvidia.com/mig-7g.80gb: '1'