gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Error when trying to use operator on DGX A100-80GB with microk8s and mixed strategy MIG

Open reuben opened this issue 2 years ago • 19 comments

1. Issue or feature description

On a DGX A100-80GB, trying to install the operator with mixed strategy MIG, feature discovery/node labeling work fine with MIG disabled, but as soon as I set a MIG config label on the node and mig-manager reconfigures the GPUs, discovery and labeling of GPUs stop working.

2. Steps to reproduce the issue

System info:

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.3 LTS
Release:	20.04
Codename:	focal
$ dpkg -l | grep nvidia
ii  libnvidia-cfg1-470-server:amd64      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-470-server          470.103.01-0ubuntu0.20.04.1             all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-470-server:amd64   470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA libcompute package
ii  libnvidia-container-tools            1.7.0-1                                 amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64           1.7.0-1                                 amd64        NVIDIA container runtime library
ii  libnvidia-decode-470-server:amd64    470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-470-server:amd64    470.103.01-0ubuntu0.20.04.1             amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-470-server:amd64     470.103.01-0ubuntu0.20.04.1             amd64        Extra libraries for the NVIDIA Server Driver
ii  libnvidia-fbc1-470-server:amd64      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-470-server:amd64        470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-ifr1-470-server:amd64      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  nvidia-acs-disable                   19.12.0                                 amd64        Disables the PCIe ACS capability
ii  nvidia-compute-utils-470-server      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA compute utilities
ii  nvidia-conf-cachefilesd              20.06-1                                 amd64        Systemd settings for cachefilesd
ii  nvidia-container-runtime             3.7.0-1                                 all          NVIDIA container runtime
ii  nvidia-container-toolkit             1.7.0-1                                 amd64        NVIDIA container runtime hook
ii  nvidia-crashdump                     20.12-1                                 amd64        NVIDA crash dump policy
ii  nvidia-dcgm-enable                   21.07-1                                 all          Enable DCGM
ii  nvidia-disable-iscsid                20.06-1                                 all          Disable iscsid on NVIDIA platforms that don't support it
ii  nvidia-dkms-470-server               470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA DKMS package
ii  nvidia-driver-470-server             470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA Server Driver metapackage
ii  nvidia-enable-journaling             20.06-1                                 all          Package that enables journal_data on root file system
ii  nvidia-fabricmanager-470             470.103.01-0ubuntu0.20.04.1             amd64        Fabric Manager for NVSwitch based systems.
ii  nvidia-icmp                          20.06-1                                 amd64        DGX iptable settings
ii  nvidia-ipmisol                       21.01-1                                 amd64        Enable IPMI Serial-over-LAN
ii  nvidia-kernel-common-470-server      470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
ii  nvidia-kernel-defaults               21.05-1                                 all          sysctl default kernel settings for DGX.
ii  nvidia-kernel-source-470-server      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
ii  nvidia-lldpd-defaults                21.05-1                                 all          lldpd defaults for Nvidia servers
ii  nvidia-logrotate                     21.11-1                                 all          NVIDIA logrotate policy
ii  nvidia-mig-manager                   0.1.2-1                                 amd64        NVIDIA MIG Partition Editor and Systemd Service
ii  nvidia-mlnx-config                   20.10.1                                 amd64        Configure the MLNX devices
ii  nvidia-motd                          21.03-1                                 all          Custom motd files for NVIDIA platforms
ii  nvidia-nvme-core-options             20.06-1                                 amd64        Modify nvme core options
ii  nvidia-nvme-smartd                   20.06-1                                 all          Enable SMART monitoring on NVME devices
ii  nvidia-oem-config-bmc                21.01-2                                 all          Ubiquity plugin to configure BMC on NVIDIA platforms
ii  nvidia-oem-config-crypt-passwd       21.01-2                                 all          Ubiquity plugin to reset crypt password
ii  nvidia-oem-config-eula               21.01-2                                 all          Ubiquity plugin to display EULA
ii  nvidia-oem-config-grub-passwd        21.01-2                                 all          Ubiquity plugin to configure GRUB password on NVIDIA platforms
ii  nvidia-oem-config-postact            21.01-2                                 all          Ubiquity plugin to complete final actions before booting
ii  nvidia-pci-bridge-power              21.11-1                                 amd64        Sets PCI bridge power control to on
ii  nvidia-peer-memory                   1.2-0-nvidia1                           all          nvidia peer memory kernel module.
ii  nvidia-peer-memory-dkms              1.2-0-nvidia1                           all          DKMS support for nvidia-peer-memory kernel modules
ii  nvidia-raid-config                   21.07-1                                 amd64        DGX RAID Configuration
ii  nvidia-redfish-config                20.10-1                                 all          Configure Redfish Host Interface
ii  nvidia-relaxed-ordering-gpu          20.10-1                                 amd64        Configure PCIe Relaxed Ordering
ii  nvidia-relaxed-ordering-nvme         20.10-1                                 amd64        Configure PCIe Relaxed Ordering
ii  nvidia-repo-keys                     20.06-1                                 amd64        Adds keys to apt trusted.gpg database
ii  nvidia-system-tools                  20.11-1                                 amd64        Metapackage for NVIDIA system tools stack
ii  nvidia-utils-470-server              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA Server Driver support binaries
ii  xserver-xorg-video-nvidia-470-server 470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA binary Xorg driver

Steps:

$ sudo snap install microk8s --classic
$ sudo microk8s.enable dns helm3
$ sudo microk8s.helm3 repo add nvidia https://nvidia.github.io/gpu-operator
$ sudo microk8s.helm3 repo update
$ cat gpu-operator-helm-chart-options.yaml
operator:
  defaultRuntime: containerd

driver:
  enabled: false

mig:
  strategy: mixed

toolkit:
  enabled: false
  env:
  - name: CONTAINERD_CONFIG
    value: /var/snap/microk8s/current/args/containerd-template.toml
  - name: CONTAINERD_SOCKET
    value: /var/snap/microk8s/common/run/containerd.sock

$ sudo microk8s.helm3 install --wait gpu-operator -n gpu-operator --create-namespace -f gpu-operator-helm-chart-options.yaml nvidia/gpu-operator
NAME: gpu-operator
LAST DEPLOYED: Mon Feb 14 19:07:44 2022
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Then:

$ kubectl get pods -n gpu-operator
NAME                                                          READY   STATUS                  RESTARTS      AGE
gpu-operator-node-feature-discovery-master-5f6fb954cf-pzl54   1/1     Running                 0             3m15s
gpu-operator-node-feature-discovery-worker-vxx4w              1/1     Running                 0             3m15s
gpu-operator-7ff85f9c4f-6cggd                                 1/1     Running                 0             3m15s
nvidia-mig-manager-z87d5                                      1/1     Running                 0             2m56s
nvidia-operator-validator-p78bm                               0/1     Init:2/4                0             2m56s
nvidia-dcgm-exporter-fw6mm                                    0/1     CrashLoopBackOff        4 (86s ago)   2m56s
gpu-feature-discovery-rchnt                                   0/1     CrashLoopBackOff        4 (85s ago)   2m56s
nvidia-device-plugin-daemonset-vlmjb                          0/1     CrashLoopBackOff        4 (71s ago)   2m56s
nvidia-cuda-validator-ws464                                   0/1     Init:CrashLoopBackOff   4 (69s ago)   2m46s
$ kubectl logs -n gpu-operator pods/nvidia-device-plugin-daemonset-vlmjb
2022/02/14 19:11:12 Loading NVML
2022/02/14 19:11:12 Starting FS watcher.
2022/02/14 19:11:12 Starting OS watcher.
2022/02/14 19:11:12 Retreiving plugins.
2022/02/14 19:11:12 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-78dba802-c88e-5c2f-f8f1-1d6715d3b565

goroutine 1 [running]:
main.(*migStrategyMixed).GetPlugins(0xe25da8, 0x5, 0xac21c0, 0xe25da8)
	/build/cmd/nvidia-device-plugin/mig-strategy.go:171 +0x865
main.start(0xc0002e3040, 0x0, 0x0)
	/build/cmd/nvidia-device-plugin/main.go:149 +0x5bc
github.com/urfave/cli/v2.(*App).RunContext(0xc000466000, 0xac8e80, 0xc000028038, 0xc0000201d0, 0x1, 0x1, 0x0, 0x0)
	/build/vendor/github.com/urfave/cli/v2/app.go:315 +0x70d
github.com/urfave/cli/v2.(*App).Run(...)
	/build/vendor/github.com/urfave/cli/v2/app.go:215
main.main()
	/build/cmd/nvidia-device-plugin/main.go:91 +0x5c5
$ kubectl logs -n gpu-operator pod/gpu-feature-discovery-rchnt
gpu-feature-discovery: 2022/02/14 19:13:48 Running gpu-feature-discovery in version v0.4.1
gpu-feature-discovery: 2022/02/14 19:13:48 Loaded configuration:
gpu-feature-discovery: 2022/02/14 19:13:48 Oneshot: false
gpu-feature-discovery: 2022/02/14 19:13:48 FailOnInitError: true
gpu-feature-discovery: 2022/02/14 19:13:48 SleepInterval: 1m0s
gpu-feature-discovery: 2022/02/14 19:13:48 MigStrategy: mixed
gpu-feature-discovery: 2022/02/14 19:13:48 NoTimestamp: false
gpu-feature-discovery: 2022/02/14 19:13:48 OutputFilePath: /etc/kubernetes/node-feature-discovery/features.d/gfd
gpu-feature-discovery: 2022/02/14 19:13:48 Start running
gpu-feature-discovery: 2022/02/14 19:13:48 Warning: Error removing output file: Failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
gpu-feature-discovery: 2022/02/14 19:13:48 Unexpected error: Error generating NVML labels: Error generating common labels: Error getting device: nvml: Insufficient Permissions

But if I then do:

$ kubectl label node/dgxa100 nvidia.com/mig.config=all-disabled --overwrite
$ kubectl logs -f -n gpu-operator pod/nvidia-mig-manager-z87d5
time="2022-02-14T19:15:00Z" level=info msg="Updating to MIG config: all-disabled"
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Asserting that the requested configuration is present in the configuration file
Selected MIG configuration is valid
Getting current value of the 'nvidia.com/mig.config.state' node label
Current value of 'nvidia.com/mig.config.state=failed'
Checking if the selected MIG config is currently applied or not
time="2022-02-14T19:15:00Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Persisting all-disabled to /etc/systemd/system/nvidia-mig-manager.service.d/override.conf
Checking if the MIG mode setting in the selected config is currently applied or not
If the state is 'rebooting', we expect this to always return true
time="2022-02-14T19:15:01Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Changing the 'nvidia.com/mig.config.state' node label to 'pending'
node/dgxa100 labeled
Shutting down all GPU clients in Kubernetes by disabling their component-specific nodeSelector labels
node/dgxa100 labeled
Waiting for the device-plugin to shutdown
pod/nvidia-device-plugin-daemonset-vlmjb condition met
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Shutting down all GPU clients on the host by stopping their systemd services
Stopping nvsm.service (active, will-restart)
Skipping nvsm-mqtt.service (inactive, will-restart)
Skipping nvsm-core.service (inactive, will-restart)
Skipping nvsm-api-gateway.service (inactive, will-restart)
Skipping nvsm-notifier.service (inactive, will-restart)
Stopping nv_peer_mem.service (active, will-restart)
Stopping nvidia-dcgm.service (active, will-restart)
Skipping dcgm.service (disabled)
Skipping dcgm-exporter.service (no-exist)
Skipping kubelet.service (no-exist)
Applying the MIG mode change from the selected config to the node
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2022-02-14T19:15:44Z" level=debug msg="Parsing config file..."
time="2022-02-14T19:15:44Z" level=debug msg="Selecting specific MIG config..."
time="2022-02-14T19:15:44Z" level=debug msg="Running apply-start hook"
time="2022-02-14T19:15:44Z" level=debug msg="Checking current MIG mode..."
time="2022-02-14T19:15:44Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="Running pre-apply-mode hook"
time="2022-02-14T19:15:44Z" level=debug msg="Applying MIG mode change..."
time="2022-02-14T19:15:44Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-02-14T19:15:45Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:45Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-02-14T19:15:45Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:45Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-02-14T19:15:45Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:45Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg="At least one mode change pending"
time="2022-02-14T19:15:45Z" level=debug msg="Resetting GPUs..."
time="2022-02-14T19:15:45Z" level=debug msg="  NVIDIA kernel module loaded"
time="2022-02-14T19:15:45Z" level=debug msg="  Using nvidia-smi to perform GPU reset"
time="2022-02-14T19:16:05Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Applying the selected MIG config to the node
time="2022-02-14T19:16:05Z" level=debug msg="Parsing config file..."
time="2022-02-14T19:16:05Z" level=debug msg="Selecting specific MIG config..."
time="2022-02-14T19:16:05Z" level=debug msg="Running apply-start hook"
time="2022-02-14T19:16:05Z" level=debug msg="Checking current MIG mode..."
time="2022-02-14T19:16:05Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="Checking current MIG device configuration..."
time="2022-02-14T19:16:05Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Restarting all GPU clients previously shutdown on the host by restarting their systemd services
Starting nvidia-dcgm.service
Starting nv_peer_mem.service
Starting nvsm-notifier.service
Starting nvsm-api-gateway.service
Starting nvsm-core.service
Starting nvsm-mqtt.service
Starting nvsm.service
Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/dgxa100 labeled
Restarting validator pod to re-run all validations
pod "nvidia-operator-validator-p78bm" deleted
Changing the 'nvidia.com/mig.config.state' node label to 'success'
node/dgxa100 labeled
time="2022-02-14T19:17:43Z" level=info msg="Successfuly updated to MIG config: all-disabled"
time="2022-02-14T19:17:43Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"

And then GPUs pop up normally on the node and I can allocate and use them:

$ kubectl describe nodes
[...]
Capacity:
  cpu:                256
  ephemeral-storage:  1843217020Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             2113603344Ki
  nvidia.com/gpu:     8
  pods:               110
[...]

reuben avatar Feb 14 '22 19:02 reuben

@reuben Which mig-config are you trying to set with mig-manager? Can you share the mig-manager log when you enable it? Also, which version of operator/toolkit is this?

shivamerla avatar Feb 17 '22 12:02 shivamerla

@shivamerla latest version of the operator, v1.9.1. mig-parted config is here, I'm trying to use the coqui-dgxa100-0-layout config. I can't get the logs easily now but there are no errors, it enables MIG mode just fine, and I can verify it by running nvidia-smi on the host, it's the rest of the operator components that stop working after that.

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false

      all-7g.80gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "7g.80gb": 1

      coqui-dgxa100-0-layout:
        - devices: [0, 1, 2, 3]
          mig-enabled: true
          mig-devices:
            "7g.80gb": 1
        - devices: [4]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [5]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 3
        - devices: [6]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "2g.20gb": 2
        - devices: [7]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "2g.20gb": 1
            "1g.10gb": 2

reuben avatar Feb 17 '22 12:02 reuben

@shivamerla actually I just realized the mig manager log is already included in https://github.com/NVIDIA/gpu-operator/issues/322#issue-1137737440

reuben avatar Feb 25 '22 17:02 reuben

Sorry, I guess you want the one enabling the mode, not disabling it.

reuben avatar Feb 25 '22 17:02 reuben

@reuben yes, please attach the log when when enabling MIG with specific config. Also, ensure that MIG mode is not enabled prior to install GPU Operator. cc @klueska

shivamerla avatar Mar 08 '22 12:03 shivamerla

@shivamerla here's the log when enabling MIG with $ kubectl label node/dgxa100 nvidia.com/mig.config=coqui-dgxa100-0-layout:

time="2022-03-13T12:03:03Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
time="2022-03-13T12:03:45Z" level=info msg="Updating to MIG config: coqui-dgxa100-0-layout"
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Asserting that the requested configuration is present in the configuration file
Selected MIG configuration is valid
Getting current value of the 'nvidia.com/mig.config.state' node label
Current value of 'nvidia.com/mig.config.state=success'
Checking if the selected MIG config is currently applied or not
time="2022-03-13T12:03:46Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Persisting coqui-dgxa100-0-layout to /etc/systemd/system/nvidia-mig-manager.service.d/override.conf
Checking if the MIG mode setting in the selected config is currently applied or not
If the state is 'rebooting', we expect this to always return true
time="2022-03-13T12:03:47Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Changing the 'nvidia.com/mig.config.state' node label to 'pending'
node/dgxa100 labeled
Shutting down all GPU clients in Kubernetes by disabling their component-specific nodeSelector labels
node/dgxa100 labeled
Waiting for the device-plugin to shutdown
pod/nvidia-device-plugin-daemonset-dsjwr condition met
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Shutting down all GPU clients on the host by stopping their systemd services
Stopping nvsm.service (active, will-restart)
Skipping nvsm-mqtt.service (inactive, will-restart)
Skipping nvsm-core.service (inactive, will-restart)
Skipping nvsm-api-gateway.service (inactive, will-restart)
Skipping nvsm-notifier.service (inactive, will-restart)
Stopping nv_peer_mem.service (active, will-restart)
Stopping nvidia-dcgm.service (active, will-restart)
Skipping dcgm.service (disabled)
Skipping dcgm-exporter.service (no-exist)
Skipping kubelet.service (no-exist)
Applying the MIG mode change from the selected config to the node
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2022-03-13T12:04:35Z" level=debug msg="Parsing config file..."
time="2022-03-13T12:04:35Z" level=debug msg="Selecting specific MIG config..."
time="2022-03-13T12:04:35Z" level=debug msg="Running apply-start hook"
time="2022-03-13T12:04:35Z" level=debug msg="Checking current MIG mode..."
time="2022-03-13T12:04:35Z" level=debug msg="Walking MigConfig for (devices=[0 1 2 3])"
time="2022-03-13T12:04:35Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-03-13T12:04:35Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-13T12:04:35Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:35Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-03-13T12:04:35Z" level=debug msg="Running pre-apply-mode hook"
time="2022-03-13T12:04:35Z" level=debug msg="Applying MIG mode change..."
time="2022-03-13T12:04:35Z" level=debug msg="Walking MigConfig for (devices=[0 1 2 3])"
time="2022-03-13T12:04:35Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-03-13T12:04:35Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:35Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-03-13T12:04:35Z" level=debug msg="    Updating MIG mode: Enabled"
time="2022-03-13T12:04:38Z" level=debug msg="    Mode change pending: false"
time="2022-03-13T12:04:38Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-03-13T12:04:38Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:38Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-03-13T12:04:38Z" level=debug msg="    Updating MIG mode: Enabled"
time="2022-03-13T12:04:41Z" level=debug msg="    Mode change pending: false"
time="2022-03-13T12:04:41Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-03-13T12:04:41Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:41Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-03-13T12:04:41Z" level=debug msg="    Updating MIG mode: Enabled"
time="2022-03-13T12:04:43Z" level=debug msg="    Mode change pending: false"
time="2022-03-13T12:04:43Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-03-13T12:04:43Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:43Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-03-13T12:04:43Z" level=debug msg="    Updating MIG mode: Enabled"
time="2022-03-13T12:04:46Z" level=debug msg="    Mode change pending: false"
time="2022-03-13T12:04:46Z" level=debug msg="Walking MigConfig for (devices=[4])"
time="2022-03-13T12:04:46Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-03-13T12:04:46Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:46Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-03-13T12:04:46Z" level=debug msg="    Updating MIG mode: Enabled"
time="2022-03-13T12:04:49Z" level=debug msg="    Mode change pending: false"
time="2022-03-13T12:04:49Z" level=debug msg="Walking MigConfig for (devices=[5])"
time="2022-03-13T12:04:49Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-03-13T12:04:49Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:49Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-03-13T12:04:49Z" level=debug msg="    Updating MIG mode: Enabled"
time="2022-03-13T12:04:52Z" level=debug msg="    Mode change pending: false"
time="2022-03-13T12:04:52Z" level=debug msg="Walking MigConfig for (devices=[6])"
time="2022-03-13T12:04:52Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-03-13T12:04:52Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:52Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-03-13T12:04:52Z" level=debug msg="    Updating MIG mode: Enabled"
time="2022-03-13T12:04:54Z" level=debug msg="    Mode change pending: false"
time="2022-03-13T12:04:54Z" level=debug msg="Walking MigConfig for (devices=[7])"
time="2022-03-13T12:04:54Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-03-13T12:04:54Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:55Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-03-13T12:04:55Z" level=debug msg="    Updating MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="    Mode change pending: false"
time="2022-03-13T12:04:57Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Applying the selected MIG config to the node
time="2022-03-13T12:04:57Z" level=debug msg="Parsing config file..."
time="2022-03-13T12:04:57Z" level=debug msg="Selecting specific MIG config..."
time="2022-03-13T12:04:57Z" level=debug msg="Running apply-start hook"
time="2022-03-13T12:04:57Z" level=debug msg="Checking current MIG mode..."
time="2022-03-13T12:04:57Z" level=debug msg="Walking MigConfig for (devices=[0 1 2 3])"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:57Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:57Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:57Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:57Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="Walking MigConfig for (devices=[4])"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:57Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="Walking MigConfig for (devices=[5])"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:57Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="Walking MigConfig for (devices=[6])"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:57Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="Walking MigConfig for (devices=[7])"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:57Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-13T12:04:57Z" level=debug msg="Checking current MIG device configuration..."
time="2022-03-13T12:04:57Z" level=debug msg="Walking MigConfig for (devices=[0 1 2 3])"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG config: map[7g.80gb:1]"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-03-13T12:04:57Z" level=debug msg="    Asserting MIG config: map[7g.80gb:1]"
time="2022-03-13T12:04:57Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    Asserting MIG config: map[7g.80gb:1]"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    Asserting MIG config: map[7g.80gb:1]"
time="2022-03-13T12:04:58Z" level=debug msg="Walking MigConfig for (devices=[4])"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    Asserting MIG config: map[3g.40gb:2]"
time="2022-03-13T12:04:58Z" level=debug msg="Walking MigConfig for (devices=[5])"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    Asserting MIG config: map[2g.20gb:3]"
time="2022-03-13T12:04:58Z" level=debug msg="Walking MigConfig for (devices=[6])"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    Asserting MIG config: map[2g.20gb:2 3g.40gb:1]"
time="2022-03-13T12:04:58Z" level=debug msg="Walking MigConfig for (devices=[7])"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    Asserting MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"
time="2022-03-13T12:04:58Z" level=debug msg="Running pre-apply-config hook"
time="2022-03-13T12:04:58Z" level=debug msg="Applying MIG device configuration..."
time="2022-03-13T12:04:58Z" level=debug msg="Walking MigConfig for (devices=[0 1 2 3])"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:58Z" level=debug msg="    Updating MIG config: map[7g.80gb:1]"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:58Z" level=debug msg="    Updating MIG config: map[7g.80gb:1]"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:58Z" level=debug msg="    Updating MIG config: map[7g.80gb:1]"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:58Z" level=debug msg="    Updating MIG config: map[7g.80gb:1]"
time="2022-03-13T12:04:58Z" level=debug msg="Walking MigConfig for (devices=[4])"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:58Z" level=debug msg="    Updating MIG config: map[3g.40gb:2]"
time="2022-03-13T12:04:58Z" level=debug msg="Walking MigConfig for (devices=[5])"
time="2022-03-13T12:04:58Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-03-13T12:04:58Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:59Z" level=debug msg="    Updating MIG config: map[2g.20gb:3]"
time="2022-03-13T12:04:59Z" level=debug msg="Walking MigConfig for (devices=[6])"
time="2022-03-13T12:04:59Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-03-13T12:04:59Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:59Z" level=debug msg="    Updating MIG config: map[2g.20gb:2 3g.40gb:1]"
time="2022-03-13T12:04:59Z" level=debug msg="Walking MigConfig for (devices=[7])"
time="2022-03-13T12:04:59Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-03-13T12:04:59Z" level=debug msg="    MIG capable: true\n"
time="2022-03-13T12:04:59Z" level=debug msg="    Updating MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"
time="2022-03-13T12:04:59Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Restarting all GPU clients previously shutdown on the host by restarting their systemd services
Starting nvidia-dcgm.service
Starting nv_peer_mem.service
Starting nvsm-notifier.service
Starting nvsm-api-gateway.service
Starting nvsm-core.service
Starting nvsm-mqtt.service
Starting nvsm.service
Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/dgxa100 labeled
Restarting validator pod to re-run all validations
pod "nvidia-operator-validator-hlmlr" deleted
Changing the 'nvidia.com/mig.config.state' node label to 'success'
node/dgxa100 labeled
time="2022-03-13T12:06:47Z" level=info msg="Successfuly updated to MIG config: coqui-dgxa100-0-layout"
time="2022-03-13T12:06:47Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"

reuben avatar Mar 13 '22 12:03 reuben

And after that, as before:

$ kubectl logs -n gpu-operator pod/nvidia-device-plugin-daemonset-6c27l
2022/03/13 12:09:47 Loading NVML
2022/03/13 12:09:47 Starting FS watcher.
2022/03/13 12:09:47 Starting OS watcher.
2022/03/13 12:09:47 Retreiving plugins.
2022/03/13 12:09:47 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-78dba802-c88e-5c2f-f8f1-1d6715d3b565

goroutine 1 [running]:
main.(*migStrategyMixed).GetPlugins(0xe25da8, 0x5, 0xac21c0, 0xe25da8)
	/build/cmd/nvidia-device-plugin/mig-strategy.go:171 +0x865
main.start(0xc0002e3040, 0x0, 0x0)
	/build/cmd/nvidia-device-plugin/main.go:149 +0x5bc
github.com/urfave/cli/v2.(*App).RunContext(0xc000466000, 0xac8e80, 0xc000028038, 0xc0000201d0, 0x1, 0x1, 0x0, 0x0)
	/build/vendor/github.com/urfave/cli/v2/app.go:315 +0x70d
github.com/urfave/cli/v2.(*App).Run(...)
	/build/vendor/github.com/urfave/cli/v2/app.go:215
main.main()
	/build/cmd/nvidia-device-plugin/main.go:91 +0x5c5

reuben avatar Mar 13 '22 12:03 reuben

$ kubectl logs -n gpu-operator pod/gpu-feature-discovery-c75xt
gpu-feature-discovery: 2022/03/13 12:09:51 Running gpu-feature-discovery in version v0.4.1
gpu-feature-discovery: 2022/03/13 12:09:51 Loaded configuration:
gpu-feature-discovery: 2022/03/13 12:09:51 Oneshot: false
gpu-feature-discovery: 2022/03/13 12:09:51 FailOnInitError: true
gpu-feature-discovery: 2022/03/13 12:09:51 SleepInterval: 1m0s
gpu-feature-discovery: 2022/03/13 12:09:51 MigStrategy: mixed
gpu-feature-discovery: 2022/03/13 12:09:51 NoTimestamp: false
gpu-feature-discovery: 2022/03/13 12:09:51 OutputFilePath: /etc/kubernetes/node-feature-discovery/features.d/gfd
gpu-feature-discovery: 2022/03/13 12:09:51 Start running
gpu-feature-discovery: 2022/03/13 12:09:51 Warning: Error removing output file: Failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
gpu-feature-discovery: 2022/03/13 12:09:51 Unexpected error: Error generating NVML labels: Error generating common labels: Error getting device: nvml: Insufficient Permissions
gpu-feature-discovery: 2022/03/13 12:09:51 Exiting

reuben avatar Mar 13 '22 12:03 reuben

@reuben Can you please retest this with v1.10.0 of operator and verify?

shivamerla avatar Mar 23 '22 17:03 shivamerla

I get the same errors on v1.10.0

reuben avatar Mar 27 '22 20:03 reuben

In fact trying the single strategy now as a workaround to improve utilization of the GPUs, I can't even get that to work, same errors, seems just enabling MIG is enough to throw everything off.

reuben avatar Mar 27 '22 20:03 reuben

I think the gpu-feature-discovery error is the critical one here, which is stopping me from being able to schedule GPU pods:

gpu-feature-discovery: 2022/03/13 12:09:51 Unexpected error: Error generating NVML labels: Error generating common labels: Error getting device: nvml: Insufficient Permissions

Is there some way I can manually set the available devices so I can workaround it?

reuben avatar Mar 27 '22 22:03 reuben

Because of the similarity of the error message I tried the workaround described in https://github.com/NVIDIA/nvidia-docker/issues/1547, but it didn't work.

reuben avatar Mar 27 '22 22:03 reuben

I would only expect an error of Insufficient Permissions for gpu feature discovery if NVIDIA_MIG_MONITOR_DEVICES=all was not set as an environment variable when it was launched. As far as I know this should be set by the operator though.

klueska avatar Mar 28 '22 13:03 klueska

A-ha! Thanks for the tip. I'm running [gpu-admission-webook](https://gitlab.com/ktarplee/gpu-admission-webhook) to make sure pods only see the GPUs they get allocated and forgot to disable it in the operator namespace. Fixing that made the gpu-feature-discovery error go away, but now the MIG manager is the one failing :(

Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/dgxa100 labeled
Waiting for the device-plugin to shutdown
pod/nvidia-device-plugin-daemonset-nrgmz condition met
Waiting for gpu-feature-discovery to shutdown
pod/gpu-feature-discovery-6f6bx condition met
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
pod/nvidia-dcgm-rxgg6 condition met
Applying the MIG mode change from the selected config to the node
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2022-03-28T18:08:09Z" level=debug msg="Parsing config file..."
time="2022-03-28T18:08:09Z" level=debug msg="Selecting specific MIG config..."
time="2022-03-28T18:08:09Z" level=debug msg="Running apply-start hook"
time="2022-03-28T18:08:09Z" level=debug msg="Checking current MIG mode..."
time="2022-03-28T18:08:09Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Applying the selected MIG config to the node
time="2022-03-28T18:08:09Z" level=debug msg="Parsing config file..."
time="2022-03-28T18:08:09Z" level=debug msg="Selecting specific MIG config..."
time="2022-03-28T18:08:09Z" level=debug msg="Running apply-start hook"
time="2022-03-28T18:08:09Z" level=debug msg="Checking current MIG mode..."
time="2022-03-28T18:08:09Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-03-28T18:08:09Z" level=debug msg="Checking current MIG device configuration..."
time="2022-03-28T18:08:09Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="Running pre-apply-config hook"
time="2022-03-28T18:08:09Z" level=debug msg="Applying MIG device configuration..."
time="2022-03-28T18:08:09Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-03-28T18:08:09Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-03-28T18:08:09Z" level=debug msg="    MIG capable: true\n"
time="2022-03-28T18:08:09Z" level=debug msg="Running apply-exit hook"
time="2022-03-28T18:08:09Z" level=fatal msg="Error getting MIGConfig: error getting Compute instances for profile '(0, 0)': Insufficient Permissions"
Restarting all GPU clients previouly shutdown by reenabling their component-specific nodeSelector labels
node/dgxa100 labeled
Changing the 'nvidia.com/mig.config.state' node label to 'failed'
node/dgxa100 labeled
time="2022-03-28T18:08:09Z" level=error msg="Error: exit status 1"
time="2022-03-28T18:08:09Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"

reuben avatar Mar 28 '22 18:03 reuben

Nevermind, I had to kill both the GPU feature discovery pod and the MIG manager pod after disabling the webhook on the namespace for the change to take effect! Now I think everything is working fine :) I'll double check and close the issue.

reuben avatar Mar 28 '22 18:03 reuben

That would suggest that the mig-manager is not running with the correct privileges. However, we clearly run it as a privileged pod by default: https://github.com/NVIDIA/gpu-operator/blob/master/assets/state-mig-manager/0600_daemonset.yaml#L54 Are you changing this setting somehow?

klueska avatar Mar 28 '22 18:03 klueska

What is this webhook exactly, and why was it interfering in this way. It would be good to know in case this issue comes up again in the future.

klueska avatar Mar 28 '22 18:03 klueska

This might be the more supported way of achieving what you are trying to do with that webhook: https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit (though I'm not sure how easy it is to plumb this through with the operator)

klueska avatar Mar 28 '22 18:03 klueska

NVIDIA_MIG_MONITOR_DEVICES=all

that's works! ,thanks ! helps me a lot

twskipper avatar Sep 15 '22 15:09 twskipper

NVIDIA_MIG_MONITOR_DEVICES=all

that's works! ,thanks ! helps me a lot

Doesn't hep. I either get Insufficient Permissions or:

cannot set MIG_MONITOR_DEVICES in non privileged container: unknown

if I try to NVIDIA_MIG_MONITOR_DEVICES=all

emsi avatar Jan 08 '24 19:01 emsi

@twskipper are you using gpu operator and it didn't set this env var?

yunfeng-scale avatar Apr 26 '24 03:04 yunfeng-scale