Successfully overwrite the mig partition but cannot find the partition on node
I changed the config from all-3g to all-7g $ kubectl label node rtx1 nvidia.com/mig.config=all-7g.40gb --overwrite node/rtx1 labeled when I check this command, it shows successfully changed the mig config $ nvidia-smi -L GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-03ca4983-f693-39d2-d7e0-25090fe07b2f) MIG 7g.40gb Device 0: (UUID: MIG-a28fecf6-35ba-56a6-aab8-2be643b31249) GPU 1: NVIDIA TITAN RTX (UUID: GPU-21058121-c040-c847-712d-da7a5cf48e4b) GPU 2: NVIDIA TITAN RTX (UUID: GPU-edc7db6f-0fec-bc09-9cbe-5a8d2598a62e) GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-1e09b62e-bae8-23dd-a55d-03b34ee00182) but I cannot find the new partition on the node $ kubectl describe node rtx1 | grep nvidia.com/mig nvidia.com/mig.capable=true nvidia.com/mig.config=all-7g.40gb nvidia.com/mig.config.state=success nvidia.com/mig.strategy=single nvidia.com/mig-3g.20gb: 0 nvidia.com/mig-3g.20gb: 0 nvidia.com/mig-3g.20gb 0 0
$ kubectl get pod -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-gb6f5 1/1 Running 0 5m8s gpu-operator-5798b5b564-zw5tg 1/1 Running 2 (139m ago) 179m gpu-operator-node-feature-discovery-gc-86f6495b55-ntp72 1/1 Running 1 (154m ago) 179m gpu-operator-node-feature-discovery-master-694467d5db-pddls 1/1 Running 2 (139m ago) 179m gpu-operator-node-feature-discovery-worker-g89fd 1/1 Running 2 (139m ago) 179m nvidia-container-toolkit-daemonset-96vnv 1/1 Running 1 (154m ago) 167m nvidia-cuda-validator-vcxkr 0/1 Completed 0 5m5s nvidia-dcgm-exporter-sp7q9 1/1 Running 0 5m8s nvidia-device-plugin-daemonset-pb98g 0/1 CrashLoopBackOff 5 (112s ago) 5m8s nvidia-mig-manager-kcp8m 1/1 Running 1 (154m ago) 177m nvidia-operator-validator-m8xkj 0/1 Init:3/4 1 (2m29s ago) 5m9s
Can you please provide the logs from the device plugin daemonset pod to triage why is it in CrashLoopBackOffstate?