H100 mig segmentation does not take effect

Open wawa0210 opened this issue 2 years ago • 1 comments

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): centos 7.9
Kernel Version: 3.10.0-1160.el7
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s v1.27.5
GPU Operator Version: v23.6.1

2. Issue or feature description

Installing mig single or mixed through gpu-orerator in H100 HGX environment does not take effect，mig-controller does not seem to have any error logs, the corresponding logs are as follows

[root@controller-node-1 ~]# kubectl -n gpu-operator logs nvidia-mig-manager-rn7xs
Defaulted container "nvidia-mig-manager" out of: nvidia-mig-manager, toolkit-validation (init)
WITH_SHUTDOWN_HOST_GPU_CLIENTS=false
DRIVER_ROOT=/run/nvidia/driver
DRIVER_ROOT_CTR_PATH=/run/nvidia/driver
Starting nvidia-mig-manager
W1225 07:53:18.074295   43392 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-12-25T07:53:18Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
time="2023-12-25T07:53:18Z" level=info msg="Updating to MIG config: all-1g.10gb"
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=true'
Asserting that the requested configuration is present in the configuration file
Selected MIG configuration is valid
Getting current value of the 'nvidia.com/mig.config.state' node label
Current value of 'nvidia.com/mig.config.state=success'
Checking if the selected MIG config is currently applied or not
time="2023-12-25T07:53:24Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Checking if the MIG mode setting in the selected config is currently applied or not
If the state is 'rebooting', we expect this to always return true
Selected MIG mode settings from configuration currently applied
Changing the 'nvidia.com/mig.config.state' node label to 'pending'
node/controller-node-1 labeled
Shutting down all GPU clients in Kubernetes by disabling their component-specific nodeSelector labels
node/controller-node-1 labeled
Waiting for the device-plugin to shutdown
pod/nvidia-device-plugin-daemonset-rhrl5 condition met
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Removing the cuda-validator pod
pod "nvidia-cuda-validator-ddcms" deleted
Removing the plugin-validator pod
No resources found
Applying the MIG mode change from the selected config to the node (and double checking it took effect)
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2023-12-25T07:53:30Z" level=debug msg="Parsing config file..."
time="2023-12-25T07:53:30Z" level=debug msg="Selecting specific MIG config..."
time="2023-12-25T07:53:30Z" level=debug msg="Running apply-start hook"
time="2023-12-25T07:53:30Z" level=debug msg="Checking current MIG mode..."
time="2023-12-25T07:53:31Z" level=debug msg="Walking MigConfig for (device-filter=[0x233110DE 0x232210DE 0x20B210DE 0x20B510DE 0x20F310DE 0x20F510DE], devices=all)"
time="2023-12-25T07:53:31Z" level=debug msg="Walking MigConfig for (device-filter=[0x20B010DE 0x20B110DE 0x20F110DE 0x20F610DE], devices=all)"
time="2023-12-25T07:53:31Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
time="2023-12-25T07:53:31Z" level=debug msg="Parsing config file..."
time="2023-12-25T07:53:31Z" level=debug msg="Selecting specific MIG config..."
time="2023-12-25T07:53:31Z" level=debug msg="Asserting MIG mode configuration..."
time="2023-12-25T07:53:32Z" level=debug msg="Walking MigConfig for (device-filter=[0x233110DE 0x232210DE 0x20B210DE 0x20B510DE 0x20F310DE 0x20F510DE], devices=all)"
time="2023-12-25T07:53:32Z" level=debug msg="Walking MigConfig for (device-filter=[0x20B010DE 0x20B110DE 0x20F110DE 0x20F610DE], devices=all)"
Selected MIG mode settings from configuration currently applied
Applying the selected MIG config to the node
time="2023-12-25T07:53:32Z" level=debug msg="Parsing config file..."
time="2023-12-25T07:53:32Z" level=debug msg="Selecting specific MIG config..."
time="2023-12-25T07:53:32Z" level=debug msg="Running apply-start hook"
time="2023-12-25T07:53:32Z" level=debug msg="Checking current MIG mode..."
time="2023-12-25T07:53:33Z" level=debug msg="Walking MigConfig for (device-filter=[0x233110DE 0x232210DE 0x20B210DE 0x20B510DE 0x20F310DE 0x20F510DE], devices=all)"
time="2023-12-25T07:53:33Z" level=debug msg="Walking MigConfig for (device-filter=[0x20B010DE 0x20B110DE 0x20F110DE 0x20F610DE], devices=all)"
time="2023-12-25T07:53:33Z" level=debug msg="Checking current MIG device configuration..."
time="2023-12-25T07:53:34Z" level=debug msg="Walking MigConfig for (device-filter=[0x233110DE 0x232210DE 0x20B210DE 0x20B510DE 0x20F310DE 0x20F510DE], devices=all)"
time="2023-12-25T07:53:34Z" level=debug msg="Walking MigConfig for (device-filter=[0x20B010DE 0x20B110DE 0x20F110DE 0x20F610DE], devices=all)"
time="2023-12-25T07:53:34Z" level=debug msg="Running pre-apply-config hook"
time="2023-12-25T07:53:34Z" level=debug msg="Applying MIG device configuration..."
time="2023-12-25T07:53:35Z" level=debug msg="Walking MigConfig for (device-filter=[0x233110DE 0x232210DE 0x20B210DE 0x20B510DE 0x20F310DE 0x20F510DE], devices=all)"
time="2023-12-25T07:53:35Z" level=debug msg="Walking MigConfig for (device-filter=[0x20B010DE 0x20B110DE 0x20F110DE 0x20F610DE], devices=all)"
time="2023-12-25T07:53:37Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Restarting validator pod to re-run all validations
pod "nvidia-operator-validator-x5xtk" deleted
Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/controller-node-1 labeled
Changing the 'nvidia.com/mig.config.state' node label to 'success'
node/controller-node-1 labeled
time="2023-12-25T07:53:41Z" level=info msg="Successfuly updated to MIG config: all-1g.10gb"
time="2023-12-25T07:53:41Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

4. Information to attach (optional if deemed irrelevant)

[x] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
[x] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
[ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
[ ] containerd logs journalctl -u containerd > containerd.log

[root@controller-node-1 ~]# kubectl get po -n gpu-operator
NAME                                                          READY   STATUS             RESTARTS          AGE
gpu-feature-discovery-rf6sh                                   0/1     CrashLoopBackOff   258 (2m41s ago)   23h
gpu-operator-744db8c9d7-wh529                                 1/1     Running            1 (23h ago)       26h
gpu-operator-node-feature-discovery-master-7b95bccc57-9pl6b   1/1     Running            2 (23h ago)       26h
gpu-operator-node-feature-discovery-worker-d7sgw              1/1     Running            1 (23h ago)       26h
helm-operation-install-gpu-operator-2ch5t9ntjl-vb4hn          0/1     Completed          0                 26h
helm-operation-upgrade-gpu-operator-rxcqmb9q58-qmmb6          0/1     Completed          0                 25h
nvidia-container-toolkit-daemonset-9bz58                      1/1     Running            0                 23h
nvidia-cuda-validator-h2wdz                                   0/1     Completed          0                 23h
nvidia-dcgm-exporter-lx4w2                                    1/1     Running            0                 23h
nvidia-device-plugin-daemonset-twb7g                          1/1     Running            0                 23h
nvidia-driver-daemonset-7shwz                                 1/1     Running            1 (23h ago)       23h
nvidia-mig-manager-rn7xs                                      1/1     Running            0                 23h
nvidia-operator-validator-m6x5p                               1/1     Running            0                 23h


[root@controller-node-1 ~]# kubectl get ds -n gpu-operator
NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   26h
gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>                                             26h
nvidia-container-toolkit-daemonset           1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       26h
nvidia-dcgm-exporter                         1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           26h
nvidia-device-plugin-daemonset               1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true           26h
nvidia-driver-daemonset                      1         1         1       1            1           nvidia.com/gpu.deploy.driver=true                  26h
nvidia-mig-manager                           1         1         1       1            1           nvidia.com/gpu.deploy.mig-manager=true             26h
nvidia-operator-validator                    1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true      26h

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Dec 26 '23 07:12 wawa0210

@wawa0210 the logs you provided suggest that mig-manager successfully applied the all-1g.10gb configuration. What does nvidia-smi show? And when you describe the node, do you see the expected number of nvidia.com/gpu resources being allocated?

Jan 25 '24 22:01 cdesiniotis