Nic Eggert
Nic Eggert
Okay, any additional information you need that would help debug?
Actually I was mistaken. When configured as `7g.79.gb` it _does_ consistently show up everywhere as `7g.79gb`.
I wanted to check to see if the issue was a result of upgrading existing nodes in place. In an attempt to rule that out, I used the label to...
@shivamerla Any luck in reproducing this? Happy to provide more info if you let me know what you need.
Getting slightly different behavior when I went to reproduce this again. Now things are stuck in a crash loop ``` gpu-feature-discovery-6rb27 0/1 Init:CrashLoopBackOff 11 37m gpu-operator-84d9f557c8-gp9p4 1/1 Running 0 37m...
The above problem seems to happen because the driver container is not populating `/run/nvidia/driver/proc/driver/nvidia`. Rebooting the node seems to revolve that. Not ideal, but I think it might be a...
@shivamerla That does solve the problem with the upgrade, but it means that we need to manually log into the node to restart containerd when adding, removing, or reconfiguring the...
The same issue occurs both when downgrading from v1.11.1 to v1.10.1 and when re-upgrading from v1.10.1 to v1.11.1. Driver manager logs after downgrading to v1.10.1: ``` Getting current value of...
So far I believe these have been Kubeflow notebook servers that have a GPU allocated, but are sitting idle i.e. not running any processes that are using the GPU. I...
Great. Thanks for looking into this. Please let me know if there's any other information I can provide.