gpu-operator
gpu-operator copied to clipboard
Driver won't fully start without manually draining node
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [x] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
CentOS 7 Kernel 3.10.0-1160.31.1.el7.x86_64
When I redeploy or upgrade the gpu operator on a node, the driver container doesn't seem to fully load unless I manually drain any pods that were previously using GPUs. My understanding is that the driver-manager container should take care of this, but it doesn't seem to be triggering a drain.
Before draining, I end up with containers that don't start because they can't entries under /proc/drivers/nvidia/. Using kubectl exec to get a shell in the driver container confirms that there are no entries in that directory. Draining the node seems to resolve the problem.
This problem also happens when upgrading the driver, which is more problematic.
2. Steps to reproduce the issue
Ensure that at least one pod is using a GPU on the node
$ kubectl describe node dev-worker-gpu-0
...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2795m (2%) 11300m (8%)
memory 9111656448 (0%) 16784385280 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
nvidia.com/mig-1g.10gb 1 1
nvidia.com/mig-2g.20gb 2 2
nvidia.com/mig-3g.40gb 0 0
Remove existing GPU operands from a node
kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands=false
Wait for GPU operator pods to shut down. SSH into the node and verify that the driver has been completely removed
$ lsmod | grep nvidia
$ cat /sys/module/nvidia/refcnt
cat: /sys/module/nvidia/refcnt: No such file or directory
At this point, pods are still running on the node. Attempts to run nvidia-smi in one of the pods gives
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Re-enable GPU operands
kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands-
Wait a bit for GPU operator pods to start up. They end up in this state:
gpu-feature-discovery-zk78b 0/1 Init:RunContainerError 5 5m18s
gpu-operator-84d9f557c8-zgdws 1/1 Running 0 23h
nvidia-container-toolkit-daemonset-wwk4z 1/1 Running 0 5m18s
nvidia-cuda-validator-96tx6 0/1 Completed 0 4h40m
nvidia-dcgm-exporter-5gmpz 0/1 Init:CrashLoopBackOff 4 5m18s
nvidia-device-plugin-daemonset-nfz68 0/1 Init:CrashLoopBackOff 4 5m18s
nvidia-driver-daemonset-vsv2n 1/1 Running 0 5m18s
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4kwcms 1/1 Running 0 23h
nvidia-gpu-operator-node-feature-discovery-worker-2mfh6 1/1 Running 0 23h
nvidia-gpu-operator-node-feature-discovery-worker-fkmvs 1/1 Running 0 23h
nvidia-gpu-operator-node-feature-discovery-worker-mdqjc 1/1 Running 0 23h
nvidia-operator-validator-8kpb8 0/1 Init:CrashLoopBackOff 4 5m11s
If we describe one of the pods that's in a crash loop, we see an error from the kubelet
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: mount error: stat failed: /run/nvidia/driver/proc/driver/nvidia/gpus/0000:0e:00.0: no such file or directory: unknown
We can indeed see that there are no entries in /proc/driver/nvidia in the driver pod
$ kubectl exec nvidia-driver-daemonset-vsv2n -n nvidia-gpu-operator -- ls /proc/driver/nvidia/
$
The entries are also missing when we SSH to the host
$ ls /run/nvidia/driver/proc/driver/nvidia/
$
The kernel module does seem to be loaded
$ lsmod | grep nvidia
nvidia_modeset 1137961 0
nvidia_uvm 1110679 0
nvidia 40740141 155 nvidia_modeset,nvidia_uvm
drm 456166 5 ast,ttm,drm_kms_helper,nvidia
Logs from k8s-driver-manager don't show any indicate that it tried to drain the node.
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=true'
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-6p5gg condition met
unbinding device 0000:0e:00.0
unbinding device 0000:13:00.0
unbinding device 0000:49:00.0
unbinding device 0000:4f:00.0
unbinding device 0000:86:00.0
unbinding device 0000:87:00.0
unbinding device 0000:88:00.0
unbinding device 0000:89:00.0
unbinding device 0000:8a:00.0
unbinding device 0000:8b:00.0
unbinding device 0000:94:00.0
unbinding device 0000:9a:00.0
unbinding device 0000:cc:00.0
unbinding device 0000:d1:00.0
Uncordoning node dev-worker-gpu-0...
node/dev-worker-gpu-0 already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
If I manually drain the node, then uncordon it, everything starts working again.
kubectl drain dev-worker-gpu-0 --force --grace-period=0 --delete-emptydir-data --ignore-daemonsets
kubectl uncordon dev-worker-gpu-0
kubectl exec nvidia-driver-daemonset-vsv2n -n nvidia-gpu-operator -- ls /proc/driver/nvidia/
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
capabilities
gpus
params
patches
registry
suspend
suspend_depth
version
warnings
After I restart the container toolkit (needed because of #399)
kubectl get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-79vdr 1/1 Running 0 83s
gpu-operator-84d9f557c8-zgdws 1/1 Running 0 24h
nvidia-container-toolkit-daemonset-49725 1/1 Running 0 86s
nvidia-cuda-validator-mgwtb 0/1 Completed 0 48s
nvidia-dcgm-exporter-wsv2d 1/1 Running 0 83s
nvidia-device-plugin-daemonset-4p84h 1/1 Running 0 83s
nvidia-device-plugin-validator-5bkd9 0/1 Completed 0 36s
nvidia-driver-daemonset-vsv2n 1/1 Running 0 17m
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4kwcms 1/1 Running 0 24h
nvidia-gpu-operator-node-feature-discovery-worker-2mfh6 1/1 Running 0 24h
nvidia-gpu-operator-node-feature-discovery-worker-fkmvs 1/1 Running 0 24h
nvidia-gpu-operator-node-feature-discovery-worker-mdqjc 1/1 Running 0 24h
nvidia-mig-manager-jbzr8 1/1 Running 0 3m50s
nvidia-operator-validator-gzjtb 1/1 Running 0 76s
3. Information to attach (optional if deemed irrelevant)
driver-manager config:
driver:
version: "515.48.07"
rdma:
enabled: false
manager:
env:
- name: ENABLE_AUTO_DRAIN
value: "true"
- name: DRAIN_USE_FORCE
value: "true"
- name: DRAIN_POD_SELECTOR_LABEL
value: ""
- name: DRAIN_TIMEOUT_SECONDS
value: "0s"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "true"
Kubelet logs: kubelet.log.gz Kernel logs: kernel.log.gz
@neggert Thanks for the detailed report. Currently we only evict/drain the node only when there are nvidia modules loaded and if they cannot be unloaded after evicting GPU Operator operands. From the logs of k8s-driver-manager looks like nvidia modules were already unloaded, so it didn't attempt to drain the node. May be there is a race condition happening that container-toolkit thinks driver is ready with previous modules and /run/nvidia/driver mount and gets into Running state, meanwhile driver-manager is still going through unload of drivers and unmount of /run/nvidia/driver. Since the previous mount is now stale all other pods started with nvidia runtime are failing.
This might not happen if nvidia.com/gpu.deploy.operands is not toggled, because nvidia modules will be busy and k8s-driver-manager will drain the node in that case to properly evict all operands and ensure that they start only after previous modules are unloaded. Can you try the same test with regular upgrades but without toggling that node label?
The same issue occurs both when downgrading from v1.11.1 to v1.10.1 and when re-upgrading from v1.10.1 to v1.11.1.
Driver manager logs after downgrading to v1.10.1:
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=paused-for-driver-upgrade'
nvidia driver module is already loaded with refcount 2
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-wvwhm condition met
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Waiting for mig-manager to shutdown
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Successfully uninstalled nvidia driver components
Uncordoning node dev-worker-gpu-0...
node/dev-worker-gpu-0 already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 unlabeled
node/dev-worker-gpu-0 unlabeled
When upgrading to v1.11.1
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager=true'
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=true'
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-z9vb7 condition met
unbinding device 0000:0e:00.0
unbinding device 0000:13:00.0
unbinding device 0000:49:00.0
unbinding device 0000:4f:00.0
unbinding device 0000:86:00.0
unbinding device 0000:87:00.0
unbinding device 0000:88:00.0
unbinding device 0000:89:00.0
unbinding device 0000:8a:00.0
unbinding device 0000:8b:00.0
unbinding device 0000:94:00.0
unbinding device 0000:9a:00.0
unbinding device 0000:cc:00.0
unbinding device 0000:d1:00.0
Uncordoning node dev-worker-gpu-0...
node/dev-worker-gpu-0 already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
node/dev-worker-gpu-0 not labeled
@neggert can you provide detail on the GPU pods running on the node and the workloads they are running? If they are actively using the GPU, we expect k8s-driver-manager to drain the node since it would fail when attempting to unload the driver modules. But it looks like we never get to that point in any of your log outputs -- k8s-driver-manager is able to unload the driver just fine with your GPU pods still running.
So far I believe these have been Kubeflow notebook servers that have a GPU allocated, but are sitting idle i.e. not running any processes that are using the GPU.
I tried loading up Pytorch in a REPL and allocating a GPU tensor in one of the pods, then removing/re-adding the operands (with the tensor still allocated). In this case, it seems that the driver manager drains the node and the driver comes up properly.
Driver manager logs from that scenario:
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=true'
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-5l9wp condition met
nvidia driver module is already loaded with refcount 130
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
Waiting for the container-toolkit to shutdown
pod/nvidia-container-toolkit-daemonset-4fpws condition met
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Unloading NVIDIA driver kernel modules...
nvidia_modeset 1137961 0
nvidia_uvm 1110679 2
nvidia 40740141 105 nvidia_modeset,nvidia_uvm
drm 456166 5 ast,ttm,drm_kms_helper,nvidia
Could not unload NVIDIA driver kernel modules, driver is in use
Unable to cleanup driver modules, attempting again with node drain...
Draining node dev-worker-gpu-0...
node/dev-worker-gpu-0 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-stxf5, kube-system/csi-beegfs-node-6t8nd, kube-system/kube-proxy-wm8kx, kube-system/nodelocaldns-bg82d, monitoring/prometheus-prometheus-node-exporter-qgflz, nvidia-gpu-operator/nvidia-driver-daemonset-ngpcl, nvidia-gpu-operator/nvidia-gpu-operator-node-feature-discovery-worker-m5tnw
evicting pod tests/write-to-nfs-27760065-52x9t
evicting pod tests/failed-job-heartbeat-27760065-sxjs5
evicting pod tests/hello-world-27760065-nsqkz
evicting pod nvidia-gpu-operator/nvidia-cuda-validator-f5hcr
evicting pod tests/full-gpu-in-pytorch-27760065-v7ttj
evicting pod tests/mig-in-pytorch-27760065-6qd4h
evicting pod nicholas-eggert/test-0
evicting pod tests/quota-manager-metadata-27760065-mrsjc
pod/nvidia-cuda-validator-f5hcr evicted
pod/quota-manager-metadata-27760065-mrsjc evicted
pod/full-gpu-in-pytorch-27760065-v7ttj evicted
pod/failed-job-heartbeat-27760065-sxjs5 evicted
pod/hello-world-27760065-nsqkz evicted
pod/write-to-nfs-27760065-52x9t evicted
pod/mig-in-pytorch-27760065-6qd4h evicted
pod/test-0 evicted
node/dev-worker-gpu-0 drained
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Successfully uninstalled nvidia driver components
unbinding device 0000:0e:00.0
unbinding device 0000:13:00.0
unbinding device 0000:49:00.0
unbinding device 0000:4f:00.0
unbinding device 0000:86:00.0
unbinding device 0000:87:00.0
unbinding device 0000:88:00.0
unbinding device 0000:89:00.0
unbinding device 0000:8a:00.0
unbinding device 0000:8b:00.0
unbinding device 0000:94:00.0
unbinding device 0000:9a:00.0
unbinding device 0000:cc:00.0
unbinding device 0000:d1:00.0
Uncordoning node dev-worker-gpu-0...
node/dev-worker-gpu-0 uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
Thanks for the additional info. If all GPU pods are idle at the time of driver upgrade (there are no active GPU driver clients), then it makes sense why you are hitting this issue. As currently implemented, k8s-driver-manager will not drain the node if there are no active GPU clients and therefore can successfully cleanup the driver. The issue is all pods with access to GPU, regardless if they are active GPU clients or not, need to be restarted on a driver upgrade. If they are not restarted, they will hold onto stale references of files from the previous driver installation (e.g. files under /run/nvidia/driver on the host). The issue you are seeing with /proc/driver/nvidia not being populated correctly is an issue we have seen before when a pod which has requested a GPU is not restarted across a driver upgrade.
We will aim to fix this issue in the next patch release.
Great. Thanks for looking into this. Please let me know if there's any other information I can provide.
@neggert we introduced a fix to k8s-driver-manager to address this issue. See https://gitlab.com/nvidia/cloud-native/k8s-driver-manager/-/merge_requests/37
We haven't published a release yet with these changes. But you can try out the latest build from top-of-tree by setting the following fields in ClusterPolicy:
driver:
manager:
repository: registry.gitlab.com/nvidia/cloud-native/k8s-driver-manager/staging
image: k8s-driver-manager
version: 49d67450-ubi8
env:
. . .
- name: ENABLE_GPU_POD_EVICTION
value: "true"
. . .
This fix will be included in the upcoming GPU Operator v22.9.1 release.
Great, thanks for the quick fix. We'll give it a shot once the release drops.
Hi @neggert. We just released v22.9.1. This includes a fix to k8s-driver-manager to evict all GPU pods on the node when ENABLE_GPU_POD_EVICTION env is set to true (this is the default). Please give it a try and let us know if there are any issues.
@cdesiniotis This issue does seem to be resolved, but I'm still running into #399