No GPU node in the cluster, do not create DaemonSets
Goal: Have a docker container within a k8s cluster run a pytorch script using Nvidia GPU on local at home computer.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
- Kernel Version: 6.2.0-36-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Docker (using Docker-Desktop for Linux)
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s using Docker Desktop v1.28.2
- GPU Operator Version: v23.9.0
Hardware:
- Nvidia GeForce RTX 4090
- Intel 14900KF processor
2. Issue or feature description
The gpu-operator pod is not able to find the GPU and outputs this error:
{"level":"info","ts":1699757011.503861,"msg":"version: 762213f2"}
{"level":"info","ts":1699757011.5041952,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1699757011.5099807,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1699757011.5101135,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1699757011.6101875,"msg":"starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I1112 02:43:31.610309 1 leaderelection.go:245] attempting to acquire leader lease gpu-operator/53822513.nvidia.com...
I1112 02:43:31.614497 1 leaderelection.go:255] successfully acquired lease gpu-operator/53822513.nvidia.com
{"level":"info","ts":1699757011.6146836,"msg":"Starting EventSource","controller":"clusterpolicy-controller","source":"kind source: *v1.ClusterPolicy"}
{"level":"info","ts":1699757011.6147175,"msg":"Starting EventSource","controller":"clusterpolicy-controller","source":"kind source: *v1.Node"}
{"level":"info","ts":1699757011.6147218,"msg":"Starting EventSource","controller":"clusterpolicy-controller","source":"kind source: *v1.DaemonSet"}
{"level":"info","ts":1699757011.614736,"msg":"Starting Controller","controller":"clusterpolicy-controller"}
{"level":"info","ts":1699757011.61475,"msg":"Starting EventSource","controller":"upgrade-controller","source":"kind source: *v1.ClusterPolicy"}
{"level":"info","ts":1699757011.6147707,"msg":"Starting EventSource","controller":"upgrade-controller","source":"kind source: *v1.Node"}
{"level":"info","ts":1699757011.6147761,"msg":"Starting EventSource","controller":"upgrade-controller","source":"kind source: *v1.DaemonSet"}
{"level":"info","ts":1699757011.614778,"msg":"Starting Controller","controller":"upgrade-controller"}
{"level":"info","ts":1699757011.6147697,"msg":"Starting EventSource","controller":"nvidia-driver-controller","source":"kind source: *v1alpha1.NVIDIADriver"}
{"level":"info","ts":1699757011.6147943,"msg":"Starting EventSource","controller":"nvidia-driver-controller","source":"kind source: *v1.ClusterPolicy"}
{"level":"info","ts":1699757011.6148026,"msg":"Starting EventSource","controller":"nvidia-driver-controller","source":"kind source: *v1.Node"}
{"level":"info","ts":1699757011.6148062,"msg":"Starting EventSource","controller":"nvidia-driver-controller","source":"kind source: *v1.DaemonSet"}
{"level":"info","ts":1699757011.6148114,"msg":"Starting Controller","controller":"nvidia-driver-controller"}
{"level":"info","ts":1699757011.7161403,"msg":"Starting workers","controller":"clusterpolicy-controller","worker count":1}
{"level":"info","ts":1699757011.716681,"msg":"Starting workers","controller":"upgrade-controller","worker count":1}
{"level":"info","ts":1699757011.716697,"msg":"Starting workers","controller":"nvidia-driver-controller","worker count":1}
{"level":"info","ts":1699757012.7282248,"logger":"controllers.Upgrade","msg":"Reconciling Upgrade","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1699757012.7282693,"logger":"controllers.Upgrade","msg":"Using label selector","upgrade":{"name":"cluster-policy"},"key":"app","value":"nvidia-driver-daemonset"}
{"level":"info","ts":1699757012.728283,"logger":"controllers.Upgrade","msg":"Building state"}
{"level":"info","ts":1699757012.7292824,"logger":"controllers.ClusterPolicy","msg":"Kubernetes version detected","version":"v1.28.2"}
{"level":"info","ts":1699757012.729581,"logger":"controllers.ClusterPolicy","msg":"Operator metrics initialized."}
{"level":"info","ts":1699757012.7295918,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/pre-requisites"}
{"level":"info","ts":1699757012.7298443,"logger":"controllers.ClusterPolicy","msg":"PodSecurityPolicy API is not supported. Skipping..."}
{"level":"info","ts":1699757012.7298522,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-operator-metrics"}
{"level":"info","ts":1699757012.7301328,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-driver"}
{"level":"info","ts":1699757012.7314265,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-container-toolkit"}
{"level":"info","ts":1699757012.7318559,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-operator-validation"}
{"level":"info","ts":1699757012.7318525,"logger":"controllers.Upgrade","msg":"Propagate state to state manager","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1699757012.7318795,"logger":"controllers.Upgrade","msg":"State Manager, got state update"}
{"level":"info","ts":1699757012.7318833,"logger":"controllers.Upgrade","msg":"Node states:","Unknown":0,"upgrade-done":0,"upgrade-required":0,"cordon-required":0,"wait-for-jobs-required":0,"pod-deletion-required":0,"upgrade-failed":0,"drain-required":0,"pod-restart-required":0,"validation-required":0,"uncordon-required":0}
{"level":"info","ts":1699757012.7318895,"logger":"controllers.Upgrade","msg":"Upgrades in progress","currently in progress":0,"max parallel upgrades":1,"upgrade slots available":0,"currently unavailable nodes":0,"total number of nodes":0,"maximum nodes that can be unavailable":0}
{"level":"info","ts":1699757012.7318926,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1699757012.731894,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1699757012.7318954,"logger":"controllers.Upgrade","msg":"ProcessUpgradeRequiredNodes"}
{"level":"info","ts":1699757012.7318974,"logger":"controllers.Upgrade","msg":"ProcessCordonRequiredNodes"}
{"level":"info","ts":1699757012.7318988,"logger":"controllers.Upgrade","msg":"ProcessWaitForJobsRequiredNodes"}
{"level":"info","ts":1699757012.7319002,"logger":"controllers.Upgrade","msg":"ProcessPodDeletionRequiredNodes"}
{"level":"info","ts":1699757012.731902,"logger":"controllers.Upgrade","msg":"ProcessDrainNodes"}
{"level":"info","ts":1699757012.7319036,"logger":"controllers.Upgrade","msg":"Node drain is disabled by policy, skipping this step"}
{"level":"info","ts":1699757012.7319052,"logger":"controllers.Upgrade","msg":"ProcessPodRestartNodes"}
{"level":"info","ts":1699757012.7319071,"logger":"controllers.Upgrade","msg":"Starting Pod Delete"}
{"level":"info","ts":1699757012.7319083,"logger":"controllers.Upgrade","msg":"No pods scheduled to restart"
{"level":"info","ts":1699757012.73191,"logger":"controllers.Upgrade","msg":"ProcessUpgradeFailedNodes"}
{"level":"info","ts":1699757012.7319114,"logger":"controllers.Upgrade","msg":"ProcessValidationRequiredNodes"}
{"level":"info","ts":1699757012.7319129,"logger":"controllers.Upgrade","msg":"ProcessUncordonRequiredNodes"}
{"level":"info","ts":1699757012.7319145,"logger":"controllers.Upgrade","msg":"State Manager, finished processing"}
{"level":"info","ts":1699757012.7331576,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-device-plugin"}
{"level":"info","ts":1699757012.7336576,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-dcgm"}
{"level":"info","ts":1699757012.733867,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-dcgm-exporter"}
{"level":"info","ts":1699757012.7341948,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/gpu-feature-discovery"}
{"level":"info","ts":1699757012.7345686,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-mig-manager"}
{"level":"info","ts":1699757012.7350636,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-node-status-exporter"}
{"level":"info","ts":1699757012.7354207,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-vgpu-manager"}
{"level":"info","ts":1699757012.735814,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-vgpu-device-manager"}
{"level":"info","ts":1699757012.7367134,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-sandbox-validation"}
{"level":"info","ts":1699757012.7371285,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-vfio-manager"}
{"level":"info","ts":1699757012.7375736,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-sandbox-device-plugin"}
{"level":"info","ts":1699757012.7379014,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-kata-manager"}
{"level":"info","ts":1699757012.7383142,"logger":"controllers.ClusterPolicy","msg":"Getting assets from: ","path:":"/opt/gpu-operator/state-cc-manager"}
{"level":"info","ts":1699757012.7393732,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"}
{"level":"info","ts":1699757012.739426,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"docker-desktop","GpuWorkloadConfig":"container"}
{"level":"info","ts":1699757012.7394345,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":0}
{"level":"info","ts":1699757012.739452,"logger":"controllers.ClusterPolicy","msg":"Unable to get runtime info from the cluster, defaulting to containerd"}
{"level":"info","ts":1699757012.73946,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"}
As a result:
- there is not nvidia-device-plugin DaemonSet that is deployed
- container is never able to find the GPU.
3. Steps to reproduce the issue
- Install Docker Desktop for Linux
- Launch k8s by going to settings -> kubernetes -> enable
- using helm: helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false
4. Information to attach (optional if deemed irrelevant)
A. The drivers for my GPU were pre-installed when I first installed Linux. I am able to run nerfstudio locally (no docker) and fully utilize my GPU and CUDA.
B. I noticed the NFD container is logging this message. Not sure if relevant:
I1112 02:43:31.491890 1 main.go:66] "-server is deprecated, will be removed in a future release along with the deprecated gRPC API"
I1112 02:43:31.491969 1 nfd-worker.go:219] "Node Feature Discovery Worker" version="v0.14.2" nodeName="docker-desktop" namespace="gpu-operator"
I1112 02:43:31.492181 1 nfd-worker.go:520] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-worker.conf"
I1112 02:43:31.492386 1 nfd-worker.go:552] "configuration successfully updated" configuration={"Core":{"Klog":{},"LabelWhiteList":{},"NoPublish":false,"FeatureSources":["all"],"Sources":null,"LabelSources":["all"],"SleepInterval":{"Duration":60000000000}},"Sources":{"cpu":{"cpuid":{"attributeBlacklist":["BMI1","BMI2","CLMUL","CMOV","CX16","ERMS","F16C","HTT","LZCNT","MMX","MMXEXT","NX","POPCNT","RDRAND","RDSEED","RDTSCP","SGX","SGXLC","SSE","SSE2","SSE3","SSE4","SSE42","SSSE3","TDX_GUEST"]}},"custom":[],"fake":{"labels":{"fakefeature1":"true","fakefeature2":"true","fakefeature3":"true"},"flagFeatures":["flag_1","flag_2","flag_3"],"attributeFeatures":{"attr_1":"true","attr_2":"false","attr_3":"10"},"instanceFeatures":[{"attr_1":"true","attr_2":"false","attr_3":"10","attr_4":"foobar","name":"instance_1"},{"attr_1":"true","attr_2":"true","attr_3":"100","name":"instance_2"},{"name":"instance_3"}]},"kernel":{"KconfigFile":"","configOpts":["NO_HZ","NO_HZ_IDLE","NO_HZ_FULL","PREEMPT"]},"local":{},"pci":{"deviceClassWhitelist":["02","0200","0207","0300","0302"],"deviceLabelFields":["vendor"]},"usb":{"deviceClassWhitelist":["0e","ef","fe","ff"],"deviceLabelFields":["class","vendor","device"]}}}
I1112 02:43:31.492496 1 metrics.go:70] "metrics server starting" port=8081
E1112 02:43:31.495324 1 memory.go:91] "failed to detect NUMA nodes" err="failed to list numa nodes: open /host-sys/bus/node/devices: no such file or directory"
I1112 02:43:31.498944 1 nfd-worker.go:562] "starting feature discovery..."
I1112 02:43:31.499098 1 nfd-worker.go:577] "feature discovery completed"
I1112 02:43:31.512431 1 nfd-worker.go:698] "creating NodeFeature object" nodefeature=""
C. Based on docs, I believe the Nvidia Container Toolkit container should be automatically launched by the gpu-operator helm chart, but I do not see it in my pod list
D. Notice in the gpu-operator pod, there's also a message saying "unable to get runtime info from cluster, defaulting to containerd". Not sure if this is an issue since I'm running k8s via Docker Desktop and it technically should be running using the Docker engine.
E. One final thing - I'm not able to SSH into the gpu-operator-worker pod as it gives me the error:
OCI runtime exec failed: exec failed: unable to start container process: exec: "sh": executable file not found in $PATH: unknown
- [x] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
NAME READY STATUS RESTARTS AGE
gpu-operator-1699757009-node-feature-discovery-gc-d94b5686vk2bd 1/1 Running 0 90m
gpu-operator-1699757009-node-feature-discovery-master-67bfjbcvm 1/1 Running 0 90m
gpu-operator-1699757009-node-feature-discovery-worker-fn6w5 1/1 Running 0 90m
gpu-operator-6f74bc4cd4-tstq5 1/1 Running 0 90m
- [x] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-operator-1699757009-node-feature-discovery-worker 1 1 1 1 1 <none> 90m
- [ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - [ ] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - [x] Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | Off |
| 0% 45C P8 37W / 480W | 1198MiB / 24564MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1502 G /usr/lib/xorg/Xorg 497MiB |
| 0 N/A N/A 1723 G /usr/bin/gnome-shell 77MiB |
| 0 N/A N/A 3946 G ...ures=SpareRendererForSitePerProcess 229MiB |
| 0 N/A N/A 5388 G ...ures=SpareRendererForSitePerProcess 35MiB |
| 0 N/A N/A 5642 G ...sion,SpareRendererForSitePerProcess 93MiB |
| 0 N/A N/A 5898 G ...irefox/3358/usr/lib/firefox/firefox 193MiB |
| 0 N/A N/A 11580 G gnome-control-center 6MiB |
| 0 N/A N/A 36041 G ...,WinRetrieveSuggestionsOnlyOnDemand 31MiB |
+---------------------------------------------------------------------------------------+
- [x] containerd logs
journalctl -u containerd > containerd.log
NOTE: I sent an email with the full logs too
We have a same issue with latest release (23.9.0) but work with 23.6.1.
GPU: Tesla V100 16 Go K8S: 1.27 OS: Ubuntu 22.04.3 Kernel: 5.15.0-88-generic
@benjaminprevost when you refer to "same issue" are you also running docker-desktop ?
Hi @joshpwrk , could you try https://microk8s.io/docs/addon-gpu instead of docker-desktop ? and let us know
@shivamerla / @cdesiniotis do we support Gaming cards with the Operator?
@ArangoGutierrez The GPUs that are officially supported can be found here
Looks like both @joshpwrk GPU card is not supported by the Operator
We have a same issue with latest release (23.9.0) but work with 23.6.1.
GPU: Tesla V100 16 Go K8S: 1.27 OS: Ubuntu 22.04.3 Kernel: 5.15.0-88-generic
@benjaminprevost Please file a new ticket for your use case. Could you tell us more about the Kubernetes solution you are using? is it Virtualized or bare metal? (in the new issue, not here just to be clear)