gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Driver won't fully start without manually draining node

Open neggert opened this issue 3 years ago • 6 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node?
  • [x] Are you running Kubernetes v1.13+?
  • [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [x] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

CentOS 7 Kernel 3.10.0-1160.31.1.el7.x86_64

When I redeploy or upgrade the gpu operator on a node, the driver container doesn't seem to fully load unless I manually drain any pods that were previously using GPUs. My understanding is that the driver-manager container should take care of this, but it doesn't seem to be triggering a drain.

Before draining, I end up with containers that don't start because they can't entries under /proc/drivers/nvidia/. Using kubectl exec to get a shell in the driver container confirms that there are no entries in that directory. Draining the node seems to resolve the problem.

This problem also happens when upgrading the driver, which is more problematic.

2. Steps to reproduce the issue

Ensure that at least one pod is using a GPU on the node

$ kubectl describe node dev-worker-gpu-0
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                Requests         Limits
  --------                --------         ------
  cpu                     2795m (2%)       11300m (8%)
  memory                  9111656448 (0%)  16784385280 (0%)
  ephemeral-storage       0 (0%)           0 (0%)
  hugepages-1Gi           0 (0%)           0 (0%)
  hugepages-2Mi           0 (0%)           0 (0%)
  nvidia.com/gpu          0                0
  nvidia.com/mig-1g.10gb  1                1
  nvidia.com/mig-2g.20gb  2                2
  nvidia.com/mig-3g.40gb  0                0

Remove existing GPU operands from a node

kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands=false

Wait for GPU operator pods to shut down. SSH into the node and verify that the driver has been completely removed

$ lsmod | grep nvidia
$ cat /sys/module/nvidia/refcnt
cat: /sys/module/nvidia/refcnt: No such file or directory

At this point, pods are still running on the node. Attempts to run nvidia-smi in one of the pods gives

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Re-enable GPU operands

kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands-

Wait a bit for GPU operator pods to start up. They end up in this state:

gpu-feature-discovery-zk78b                                       0/1     Init:RunContainerError   5          5m18s
gpu-operator-84d9f557c8-zgdws                                     1/1     Running                  0          23h
nvidia-container-toolkit-daemonset-wwk4z                          1/1     Running                  0          5m18s
nvidia-cuda-validator-96tx6                                       0/1     Completed                0          4h40m
nvidia-dcgm-exporter-5gmpz                                        0/1     Init:CrashLoopBackOff    4          5m18s
nvidia-device-plugin-daemonset-nfz68                              0/1     Init:CrashLoopBackOff    4          5m18s
nvidia-driver-daemonset-vsv2n                                     1/1     Running                  0          5m18s
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4kwcms   1/1     Running                  0          23h
nvidia-gpu-operator-node-feature-discovery-worker-2mfh6           1/1     Running                  0          23h
nvidia-gpu-operator-node-feature-discovery-worker-fkmvs           1/1     Running                  0          23h
nvidia-gpu-operator-node-feature-discovery-worker-mdqjc           1/1     Running                  0          23h
nvidia-operator-validator-8kpb8                                   0/1     Init:CrashLoopBackOff    4          5m11s

If we describe one of the pods that's in a crash loop, we see an error from the kubelet

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: mount error: stat failed: /run/nvidia/driver/proc/driver/nvidia/gpus/0000:0e:00.0: no such file or directory: unknown

We can indeed see that there are no entries in /proc/driver/nvidia in the driver pod

$ kubectl exec nvidia-driver-daemonset-vsv2n -n nvidia-gpu-operator -- ls /proc/driver/nvidia/
$

The entries are also missing when we SSH to the host

$ ls /run/nvidia/driver/proc/driver/nvidia/
$

The kernel module does seem to be loaded

$ lsmod | grep nvidia
nvidia_modeset       1137961  0
nvidia_uvm           1110679  0
nvidia              40740141  155 nvidia_modeset,nvidia_uvm
drm                   456166  5 ast,ttm,drm_kms_helper,nvidia

Logs from k8s-driver-manager don't show any indicate that it tried to drain the node.

Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=true'
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-6p5gg condition met
unbinding device 0000:0e:00.0
unbinding device 0000:13:00.0
unbinding device 0000:49:00.0
unbinding device 0000:4f:00.0
unbinding device 0000:86:00.0
unbinding device 0000:87:00.0
unbinding device 0000:88:00.0
unbinding device 0000:89:00.0
unbinding device 0000:8a:00.0
unbinding device 0000:8b:00.0
unbinding device 0000:94:00.0
unbinding device 0000:9a:00.0
unbinding device 0000:cc:00.0
unbinding device 0000:d1:00.0
Uncordoning node dev-worker-gpu-0...
node/dev-worker-gpu-0 already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled

If I manually drain the node, then uncordon it, everything starts working again.

kubectl drain dev-worker-gpu-0 --force --grace-period=0 --delete-emptydir-data --ignore-daemonsets
kubectl uncordon dev-worker-gpu-0
kubectl exec nvidia-driver-daemonset-vsv2n -n nvidia-gpu-operator -- ls /proc/driver/nvidia/
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
capabilities
gpus
params
patches
registry
suspend
suspend_depth
version
warnings

After I restart the container toolkit (needed because of #399)

kubectl get pods -n nvidia-gpu-operator
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-79vdr                                       1/1     Running     0          83s
gpu-operator-84d9f557c8-zgdws                                     1/1     Running     0          24h
nvidia-container-toolkit-daemonset-49725                          1/1     Running     0          86s
nvidia-cuda-validator-mgwtb                                       0/1     Completed   0          48s
nvidia-dcgm-exporter-wsv2d                                        1/1     Running     0          83s
nvidia-device-plugin-daemonset-4p84h                              1/1     Running     0          83s
nvidia-device-plugin-validator-5bkd9                              0/1     Completed   0          36s
nvidia-driver-daemonset-vsv2n                                     1/1     Running     0          17m
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4kwcms   1/1     Running     0          24h
nvidia-gpu-operator-node-feature-discovery-worker-2mfh6           1/1     Running     0          24h
nvidia-gpu-operator-node-feature-discovery-worker-fkmvs           1/1     Running     0          24h
nvidia-gpu-operator-node-feature-discovery-worker-mdqjc           1/1     Running     0          24h
nvidia-mig-manager-jbzr8                                          1/1     Running     0          3m50s
nvidia-operator-validator-gzjtb                                   1/1     Running     0          76s

3. Information to attach (optional if deemed irrelevant)

driver-manager config:

driver:
  version: "515.48.07"
  rdma:
    enabled: false
  manager:
    env:
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: "0s"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"

Kubelet logs: kubelet.log.gz Kernel logs: kernel.log.gz

neggert avatar Oct 11 '22 21:10 neggert

@neggert Thanks for the detailed report. Currently we only evict/drain the node only when there are nvidia modules loaded and if they cannot be unloaded after evicting GPU Operator operands. From the logs of k8s-driver-manager looks like nvidia modules were already unloaded, so it didn't attempt to drain the node. May be there is a race condition happening that container-toolkit thinks driver is ready with previous modules and /run/nvidia/driver mount and gets into Running state, meanwhile driver-manager is still going through unload of drivers and unmount of /run/nvidia/driver. Since the previous mount is now stale all other pods started with nvidia runtime are failing.

This might not happen if nvidia.com/gpu.deploy.operands is not toggled, because nvidia modules will be busy and k8s-driver-manager will drain the node in that case to properly evict all operands and ensure that they start only after previous modules are unloaded. Can you try the same test with regular upgrades but without toggling that node label?

shivamerla avatar Oct 12 '22 05:10 shivamerla

The same issue occurs both when downgrading from v1.11.1 to v1.10.1 and when re-upgrading from v1.10.1 to v1.11.1.

Driver manager logs after downgrading to v1.10.1:

Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=paused-for-driver-upgrade'
nvidia driver module is already loaded with refcount 2
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-wvwhm condition met
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Waiting for mig-manager to shutdown
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Successfully uninstalled nvidia driver components
Uncordoning node dev-worker-gpu-0...
node/dev-worker-gpu-0 already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 unlabeled
node/dev-worker-gpu-0 unlabeled

When upgrading to v1.11.1

Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=paused-for-driver-upgrade'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager=true'
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=true'
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-z9vb7 condition met
unbinding device 0000:0e:00.0
unbinding device 0000:13:00.0
unbinding device 0000:49:00.0
unbinding device 0000:4f:00.0
unbinding device 0000:86:00.0
unbinding device 0000:87:00.0
unbinding device 0000:88:00.0
unbinding device 0000:89:00.0
unbinding device 0000:8a:00.0
unbinding device 0000:8b:00.0
unbinding device 0000:94:00.0
unbinding device 0000:9a:00.0
unbinding device 0000:cc:00.0
unbinding device 0000:d1:00.0
Uncordoning node dev-worker-gpu-0...
node/dev-worker-gpu-0 already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
node/dev-worker-gpu-0 not labeled

neggert avatar Oct 12 '22 15:10 neggert

@neggert can you provide detail on the GPU pods running on the node and the workloads they are running? If they are actively using the GPU, we expect k8s-driver-manager to drain the node since it would fail when attempting to unload the driver modules. But it looks like we never get to that point in any of your log outputs -- k8s-driver-manager is able to unload the driver just fine with your GPU pods still running.

cdesiniotis avatar Oct 12 '22 16:10 cdesiniotis

So far I believe these have been Kubeflow notebook servers that have a GPU allocated, but are sitting idle i.e. not running any processes that are using the GPU.

I tried loading up Pytorch in a REPL and allocating a GPU tensor in one of the pods, then removing/re-adding the operands (with the tensor still allocated). In this case, it seems that the driver manager drains the node and the driver comes up properly.

Driver manager logs from that scenario:

Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=true'
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-5l9wp condition met
nvidia driver module is already loaded with refcount 130
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled
Waiting for the operator-validator to shutdown
Waiting for the container-toolkit to shutdown
pod/nvidia-container-toolkit-daemonset-4fpws condition met
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Unloading NVIDIA driver kernel modules...
nvidia_modeset       1137961  0
nvidia_uvm           1110679  2
nvidia              40740141  105 nvidia_modeset,nvidia_uvm
drm                   456166  5 ast,ttm,drm_kms_helper,nvidia
Could not unload NVIDIA driver kernel modules, driver is in use
Unable to cleanup driver modules, attempting again with node drain...
Draining node dev-worker-gpu-0...
node/dev-worker-gpu-0 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-stxf5, kube-system/csi-beegfs-node-6t8nd, kube-system/kube-proxy-wm8kx, kube-system/nodelocaldns-bg82d, monitoring/prometheus-prometheus-node-exporter-qgflz, nvidia-gpu-operator/nvidia-driver-daemonset-ngpcl, nvidia-gpu-operator/nvidia-gpu-operator-node-feature-discovery-worker-m5tnw
evicting pod tests/write-to-nfs-27760065-52x9t
evicting pod tests/failed-job-heartbeat-27760065-sxjs5
evicting pod tests/hello-world-27760065-nsqkz
evicting pod nvidia-gpu-operator/nvidia-cuda-validator-f5hcr
evicting pod tests/full-gpu-in-pytorch-27760065-v7ttj
evicting pod tests/mig-in-pytorch-27760065-6qd4h
evicting pod nicholas-eggert/test-0
evicting pod tests/quota-manager-metadata-27760065-mrsjc
pod/nvidia-cuda-validator-f5hcr evicted
pod/quota-manager-metadata-27760065-mrsjc evicted
pod/full-gpu-in-pytorch-27760065-v7ttj evicted
pod/failed-job-heartbeat-27760065-sxjs5 evicted
pod/hello-world-27760065-nsqkz evicted
pod/write-to-nfs-27760065-52x9t evicted
pod/mig-in-pytorch-27760065-6qd4h evicted
pod/test-0 evicted
node/dev-worker-gpu-0 drained
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Successfully uninstalled nvidia driver components
unbinding device 0000:0e:00.0
unbinding device 0000:13:00.0
unbinding device 0000:49:00.0
unbinding device 0000:4f:00.0
unbinding device 0000:86:00.0
unbinding device 0000:87:00.0
unbinding device 0000:88:00.0
unbinding device 0000:89:00.0
unbinding device 0000:8a:00.0
unbinding device 0000:8b:00.0
unbinding device 0000:94:00.0
unbinding device 0000:9a:00.0
unbinding device 0000:cc:00.0
unbinding device 0000:d1:00.0
Uncordoning node dev-worker-gpu-0...
node/dev-worker-gpu-0 uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/dev-worker-gpu-0 labeled

neggert avatar Oct 12 '22 19:10 neggert

Thanks for the additional info. If all GPU pods are idle at the time of driver upgrade (there are no active GPU driver clients), then it makes sense why you are hitting this issue. As currently implemented, k8s-driver-manager will not drain the node if there are no active GPU clients and therefore can successfully cleanup the driver. The issue is all pods with access to GPU, regardless if they are active GPU clients or not, need to be restarted on a driver upgrade. If they are not restarted, they will hold onto stale references of files from the previous driver installation (e.g. files under /run/nvidia/driver on the host). The issue you are seeing with /proc/driver/nvidia not being populated correctly is an issue we have seen before when a pod which has requested a GPU is not restarted across a driver upgrade.

We will aim to fix this issue in the next patch release.

cdesiniotis avatar Oct 13 '22 20:10 cdesiniotis

Great. Thanks for looking into this. Please let me know if there's any other information I can provide.

neggert avatar Oct 13 '22 20:10 neggert

@neggert we introduced a fix to k8s-driver-manager to address this issue. See https://gitlab.com/nvidia/cloud-native/k8s-driver-manager/-/merge_requests/37

We haven't published a release yet with these changes. But you can try out the latest build from top-of-tree by setting the following fields in ClusterPolicy:

driver:
  manager:
    repository: registry.gitlab.com/nvidia/cloud-native/k8s-driver-manager/staging
    image: k8s-driver-manager
    version: 49d67450-ubi8
    env:
      . . . 
      - name: ENABLE_GPU_POD_EVICTION
         value: "true"
      . . .

This fix will be included in the upcoming GPU Operator v22.9.1 release.

cdesiniotis avatar Nov 10 '22 02:11 cdesiniotis

Great, thanks for the quick fix. We'll give it a shot once the release drops.

neggert avatar Nov 14 '22 19:11 neggert

Hi @neggert. We just released v22.9.1. This includes a fix to k8s-driver-manager to evict all GPU pods on the node when ENABLE_GPU_POD_EVICTION env is set to true (this is the default). Please give it a try and let us know if there are any issues.

cdesiniotis avatar Dec 14 '22 02:12 cdesiniotis

@cdesiniotis This issue does seem to be resolved, but I'm still running into #399

neggert avatar Jan 20 '23 20:01 neggert