Bottlerocket 1.25 is preventing nvidia driver to start
Description
We have been using Bottlerocket 1.25 for the past 3 days. Since the upgrade, some of our GPU nodes (g5.xlarge) are failing to initialize the NVIDIA driver on Kubernetes 1.30. The affected image is bottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.25.0-388e1050.
Issue
When the driver fails to initialize, the node does not advertise its nvidia.com/gpu capacity, causing the corresponding pods to remain in a Pending state indefinitely. This prevents any pods scheduled on the affected node from starting.
The kubelet log on the admin container of an affected node contains the following error:
Oct 17 12:05:33 ip-xx-xx-xx-xx.ec2.internal kubelet[1655]: E1017 12:05:33.335816 1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Additional Information
- We are using Karpenter to provision our nodes, but I don't think the issue is related to it. The NodeClaim remain in an
Unknownstate indefinitely and shows the following condition message:
Message: Resource "nvidia.com/gpu" was requested but not registered
- The issue occurs on approximately 1 out of every 6 nodes, suggesting a potential race condition during node startup that interferes with NVIDIA driver initialization.
- This problem was not observed with Bottlerocket 1.24.
Expected Behavior
The NVIDIA driver should initialize correctly on all GPU nodes, and the node should advertise its GPU capacity to ensure proper scheduling.
Steps to Reproduce
- Deploy Bottlerocket 1.25 on 6 (or more)
g5.xlargeinstances in a Kubernetes 1.30 cluster. - Observe the node initialization process and monitor whether the NVIDIA driver is initialized correctly.
- Check the node’s GPU capacity and NodeClaim status.
Thank you for the report. I will try your reproduction recipe and let you know what I see.
My first attempt to reproduce this failed (all instances had a functioning NVIDIA driver), which doesn't say very much if (as seems very likely) this is a race condition between driver load and kubelet start. I will keep looking. One mildly troubling indication is that I do not see any systemd scheduling that ensures that driver load occurs before kubelet start.
If you would like to dig deeper yourself, you might look at this on a failing node:
bash-5.1# journalctl -u load-tesla-kernel-modules
Oct 18 17:40:27 localhost systemd[1]: Starting Load Tesla kernel modules...
Oct 18 17:40:28 ip-192-168-68-128.us-west-2.compute.internal driverdog[1356]: 17:40:28 [INFO] Updated modules dependencies
Oct 18 17:40:29 ip-192-168-68-128.us-west-2.compute.internal driverdog[1356]: 17:40:29 [INFO] Loaded kernel modules
Oct 18 17:40:29 ip-192-168-68-128.us-west-2.compute.internal systemd[1]: Finished Load Tesla kernel modules.
This is the systemd service that loads the NVIDIA driver kmod on a G5 (or similar) instance. If this happens before kubelet start on successful boots and after kubelet start on failed boots, we will have found the race condition you suspected.
I used systemd-analyze dump to look at the constraints it used during boot on a g5.xlarge with the 1.25.0 AMI, and I believe we can rule out a race between kmod and kubelet:
bash-5.1# systemd-analyze dump kubelet.service
...
After: configured.target (origin-file)
...
bash-5.1# systemd-analyze dump configured.target
...
After: preconfigured.target (origin-file origin-default)
...
bash-5.1# systemd-analyze dump preconfigured.target
...
After: load-tesla-kernel-modules.service (origin-default)
This should ensure that the driver is present before kubelet starts, so at least for now I would rule out a race between these two systemd services.
Any configuration details you could share would be helpful in reproducing this problem. It would be interesting to see the device plugin logging, since that's the source of the kubelet error message you provide. This is what I see on my instances:
bash-5.1# journalctl | grep nvidia-device
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.372837 1558 main.go:199] Starting FS watcher.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.372929 1558 main.go:206] Starting OS watcher.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.373179 1558 main.go:221] Starting Plugins.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.373193 1558 main.go:278] Loading configuration.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386371 1558 main.go:303] Updating config with default resource matching patterns.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386452 1558 main.go:314]
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: Running with config:
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "version": "v1",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "flags": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "migStrategy": "none",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "failOnInitError": true,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "mpsRoot": "",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "nvidiaDriverRoot": "/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "nvidiaDevRoot": "/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "gdsEnabled": false,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "mofedEnabled": false,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "useNodeFeatureAPI": null,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "deviceDiscoveryStrategy": "auto",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "plugin": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "passDeviceSpecs": true,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "deviceListStrategy": [
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "volume-mounts"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: ],
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "deviceIDStrategy": "index",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "containerDriverRoot": "/driver-root"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: },
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "resources": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "gpus": [
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "pattern": "*",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "name": "nvidia.com/gpu"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: ]
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: },
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "sharing": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: "timeSlicing": {}
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386468 1558 main.go:317] Retrieving plugins.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.541469 1558 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.546032 1558 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.553200 1558 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
There is one possibly-relevant change in 1.25.0. We added support for GPU time-slicing (as seen in https://github.com/bottlerocket-os/bottlerocket/pull/4230). Normally I would expect the nvidia-device-plugin process to run until node shutdown. If the plugin crashes or shuts down for any reason, kubelet will see an unexpected eof on the plugin socket and print that ListAndWatch ended unexpectedly error message. If you can, please share the nvidia-device-plugin logging from one of your failed instances.
Hello again, and thanks for investigating this! I've retrieved some logs from a faulty node, here they are:
bash-5.1# journalctl -u load-tesla-kernel-modules
Oct 17 12:05:19 localhost systemd[1]: Starting Load Tesla kernel modules...
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal driverdog[1375]: 12:05:21 [INFO] Updated modules dependencies
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal driverdog[1375]: 12:05:21 [INFO] Loaded kernel modules
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal systemd[1]: Finished Load Tesla kernel modules.
bash-5.1# journalctl -u kubelet
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal systemd[1]: Starting Kubelet...
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal ctr[1639]: unpacking localhost/kubernetes/pause:0.1.0 (sha256:e287062022dab2bcd00be3aa5b65a10318c93c3426cc7b72126a2108d28c47aa)...done
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal ctr[1648]: io.cri-containerd.image=managed,io.cri-containerd.pinned=pinned
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --register-with-taints has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.004911 1655 server.go:205] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.012604 1655 server.go:484] "Kubelet version" kubeletVersion="v1.30.1-eks-e564799"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.012638 1655 server.go:486] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.030358 1655 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.062024 1655 container_manager_linux.go:265] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.062066 1655 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"NodeName":"ip-xx-xx-x-xx.ec2.internal","RuntimeCgroupsName":"/runtime.slice/containerd.service","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"systemd","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":true,"KubeReservedCgroupName":"/runtime","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":{"cpu":"80m","ephemeral-storage":"1Gi","memory":"893Mi"},"SystemReserved":null,"HardEvictionThresholds":[{"Signal":"memory.available","Operator":"LessThan","Value":{"Quantity":"100Mi","Percentage":0},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.1},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.15},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null}],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":1048576,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.063928 1655 topology_manager.go:138] "Creating topology manager with none policy"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.063947 1655 container_manager_linux.go:301] "Creating device plugin manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.066091 1655 state_mem.go:36] "Initialized new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069168 1655 kubelet.go:400] "Attempting to sync node with API server"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069195 1655 kubelet.go:301] "Adding static pod path" path="/etc/kubernetes/static-pods/"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069248 1655 kubelet.go:312] "Adding apiserver pod source"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.070036 1655 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.073362 1655 kuberuntime_manager.go:261] "Container runtime initialized" containerRuntime="containerd" version="1.7.22+bottlerocket" apiVersion="v1"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.079193 1655 kubelet.go:815] "Not starting ClusterTrustBundle informer because we are in static kubelet mode"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: W1017 12:05:33.081030 1655 probe.go:272] Flexvolume plugin directory at /var/lib/kubelet/plugins/volume/exec does not exist. Recreating.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.082417 1655 server.go:1264] "Started kubelet"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal systemd[1]: Started Kubelet.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087003 1655 ratelimit.go:55] "Setting rate limiting for endpoint" service="podresources"qps=100 burstTokens=10
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087388 1655 server.go:227] "Starting to serve the podresources API" endpoint="unix:/var/lib/kubelet/pod-resources/kubelet.sock"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087484 1655 server.go:163] "Starting to listen" address="0.0.0.0" port=10250
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.088330 1655 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.089641 1655 server.go:455] "Adding debug handlers to kubelet server"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.090462 1655 kubelet.go:1467] "Image garbage collection failed once. Stats initializationmay not have completed yet" err="invalid capacity 0 on image filesystem"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090469 1655 volume_manager.go:291] "Starting Kubelet Volume Manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090500 1655 desired_state_of_world_populator.go:149] "Desired state populator starts to run"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090824 1655 reconciler.go:26] "Reconciler: start to sync state"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093738 1655 factory.go:221] Registration of the containerd container factory successfully
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093765 1655 factory.go:221] Registration of the systemd container factory successfully
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093862 1655 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110556 1655 cpu_manager.go:214] "Starting CPU manager" policy="none"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110585 1655 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110606 1655 state_mem.go:36] "Initialized new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.136374 1655 policy_none.go:49] "None policy: Start"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.137607 1655 memory_manager.go:170] "Starting memorymanager" policy="None"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.137642 1655 state_mem.go:35] "Initializing new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.138216 1655 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144592 1655 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144778 1655 status_manager.go:217] "Starting to sync pod status with apiserver"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144850 1655 kubelet.go:2337] "Starting kubelet main sync loop"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.144903 1655 kubelet.go:2361] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.191233 1655 kubelet_node_status.go:73] "Attempting to register node" node="ip-xx-xx-x-xx.ec2.internal"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.246414 1655 kubelet.go:2361] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250055 1655 manager.go:479] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250238 1655 container_log_manager.go:186] "Initializing container log rotate workers" workers=1 monitorPeriod="10s"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250347 1655 plugin_manager.go:118] "Starting Kubelet Plugin Manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.254400 1655 eviction_manager.go:282] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-xx-xx-x-xx.ec2.internal\" not found"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.309607 1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.325133 1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.335816 1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.378294 1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Canceled desc = grpc: the client connection is closing" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.734727 1655 kubelet_node_status.go:76] "Successfully registered node" node="ip-xx-xx-x-xx.ec2.internal"
Oct 17 12:05:34 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:34.070697 1655 apiserver.go:52] "Watching apiserver"
bash-5.1# journalctl | grep nvidia-device
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.230706 1667 main.go:199] Starting FS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.231641 1667 main.go:206] Starting OS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.232173 1667 main.go:221] Starting Plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.232186 1667 main.go:278] Loading configuration.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236537 1667 main.go:303] Updating config with default resource matching patterns.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236586 1667 main.go:314]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: Running with config:
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "version": "v1",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "flags": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "migStrategy": "none",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "failOnInitError": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "mpsRoot": "",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaDriverRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaDevRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "gdsEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "mofedEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "useNodeFeatureAPI": null,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceDiscoveryStrategy": "auto",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "plugin": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "passDeviceSpecs": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceListStrategy": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "volume-mounts"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: ],
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceIDStrategy": "index",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "containerDriverRoot": "/driver-root"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "resources": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "gpus": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "pattern": "*",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "name": "nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: ]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "sharing": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "timeSlicing": {}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236597 1667 main.go:317] Retrieving plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.299855 1667 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.305466 1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312453 1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312491 1667 main.go:246] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312501 1667 main.go:353] Stopping plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312507 1667 server.go:185] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313088 1667 main.go:221] Starting Plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313097 1667 main.go:278] Loading configuration.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313542 1667 main.go:303] Updating config with default resource matching patterns.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313587 1667 main.go:314]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: Running with config:
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "version": "v1",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "flags": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "migStrategy": "none",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "failOnInitError": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "mpsRoot": "",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaDriverRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaDevRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "gdsEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "mofedEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "useNodeFeatureAPI": null,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceDiscoveryStrategy": "auto",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "plugin": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "passDeviceSpecs": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceListStrategy": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "volume-mounts"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: ],
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "deviceIDStrategy": "index",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "containerDriverRoot": "/driver-root"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "resources": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "gpus": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "pattern": "*",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "name": "nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: ]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "sharing": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: "timeSlicing": {}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313597 1667 main.go:317] Retrieving plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.323855 1667 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.324374 1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.326374 1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I'm keeping that node around in case you have other ideas of logs that could help. Thanks again!
@larvacea, we also added nvidia-persistenced (https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/122), and time-slicing wouldn't play a role here unless the users enabled it, which seems like they didn't. I wonder if nvidia-persistenced is messing up with the device plugin?
@nikoul, in the failing instance, could you please check the journal for any other errors:
journalctl -p4
@arnaldo2792 Of course, here is the error log from journalctl -p4:
bash-5.1# journalctl -p4
Oct 17 12:05:14 localhost kernel: #3
Oct 17 12:05:14 localhost kernel: GPT:Primary header thinks Alt. header is not at the end of the disk.
Oct 17 12:05:14 localhost kernel: GPT:2097151 != 125829119
Oct 17 12:05:14 localhost kernel: GPT:Alternate GPT header not at the end of the disk.
Oct 17 12:05:14 localhost kernel: GPT:2097151 != 125829119
Oct 17 12:05:14 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Oct 17 12:05:14 localhost kernel: Module i8042 is blacklisted
Oct 17 12:05:14 localhost kernel: Module i8042 is blacklisted
Oct 17 12:05:15 localhost kernel: XFS (nvme1n1p1): EXPERIMENTAL Large extent counts feature in use. Use at your own risk!
Oct 17 12:05:17 localhost systemd-networkd[1192]: eth0: found matching network '/etc/systemd/network/10-eth0.network', based on potentially unpredictable interface name.
Oct 17 12:05:17 localhost systemd-networkd[1192]: eth0: found matching network '/etc/systemd/network/10-eth0.network', based on potentially unpredictable interface name.
Oct 17 12:05:19 localhost systemd-networkd[1192]: eth0: found matching network '/etc/systemd/network/10-eth0.network', based on potentially unpredictable interface name.
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: nvidia: loading out-of-tree module taints kernel.
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: nvidia: module license 'NVIDIA' taints kernel.
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: Disabling lock debugging due to kernel taint
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel:
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.183.06 Wed Jun 26 06:46:07 UTC 2024
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
Oct 17 12:05:50 ip-xx-xx-x-xx.ec2.internal systemd-run[12]: Failed to create bus connection: No such file or directory
Oct 17 12:06:08 ip-xx-xx-x-xx.ec2.internal kernel: process 'linkerd2-proxy-' launched '/usr/lib/linkerd/linkerd2-proxy' with NULL argv: empty string added
I thought I had a Karpenter issue following upgrade to v1 but I might be having exactly the same issue: https://github.com/aws/karpenter-provider-aws/issues/7046#issuecomment-2435373013
I thought I had a Karpenter issue following upgrade to v1 but I might be having exactly the same issue: aws/karpenter-provider-aws#7046 (comment)
Oh, thanks for sharing. Yeah, if you don't have the issue with 1.24, then it's very likely to be that same issue.
From the logs, this seems like the problematic sequence:
# plugin manager starts
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250347 1655 plugin_manager.go:118] "Starting Kubelet Plugin Manager"
# k8s device plugin registers with kubelet
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.305466 1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.309607 1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312453 1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
# kubelet.sock re-created? plugin restarts
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312491 1667 main.go:246] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312501 1667 main.go:353] Stopping plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312507 1667 server.go:185] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
...
# k8s device plugin registers with kubelet again
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.324374 1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.325133 1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.326374 1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
# kubelet errors out
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.335816 1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.378294 1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Canceled desc = grpc: the client connection is closing" resource="nvidia.com/gpu"
For affected nodes, journalctl -t kubelet -t nvidia-device-plugin -o short-precise should line up logs from the components involved with the higher precision timestamps on the left column.
Specifically, this inotify event firing appears to cause the trouble:
inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
Also of interest, kubelet's plugin manager starts after the FS watcher starts:
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.230706 1667 main.go:199] Starting FS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250347 1655 plugin_manager.go:118] "Starting Kubelet Plugin Manager"
Since the FS watcher watches the directory, it will see the initial kubelet.sock creation and treat it as a kubelet restart.
Thanks for the notes @bcressey. I'm working on changes to ensure nvidia-device-plugin starts after the kubelet socket becomes available.
I just had the same issue on Bottlerocket OS 1.24.1 (aws-k8s-1.31-nvidia), it happens way less often than on 1.25, but it just happened.
I never had this issue until recently (≃ 2024-10, which matches the time I switched to EKS 1.31).
Karpenter tells me Resource "nvidia.com/gpu" was requested but not registered, the node has capacity nvidia.com/gpu: 0.
Sadly this happened in prod where SSH access and the admin container are disabled, not sure how I can get logs out.
My understanding of the bug is that there's nothing particular about 1.25.0 which makes this race condition -- I suspect it has been possible since the device plugin was introduced, though perhaps the timing is somewhat more likely now.
As an update, I'm in the process of testing my changes right now. Hoping to have a PR out with the changes soon.
Sorry for the delay here, the race condition was so rare in my testing environment that it was challenging to prove that we actually resolved the issue.
After more thorough investigation, there are several unfortunate behaviors that contribute to this bug.
In this case:
- The
kubeletsystemd service isType=notify, which means it signals to systemd when it has finished starting (here in the kubelet sources) - The
device-pluginsocket is started in a goroutine inkubelet. There's no synchronization between when the socket is finally created and the systemd notification mentioned above, sokubeletcan mark itself as "started" before the socket is created. - The
nvidia-k8s-device-pluginis written to restart its k8s plugins if it notices a "create" event for the kubelet socket. - If the
nvidia-k8s-device-pluginhas started its device plugin gRPC server and kubelet has initially registered it, thenvidia-k8s-device-plugin's restart can cause an error state in kubelet, which appears to be attempting to communicate with the "old" socket and then removing the new registration.
In practice, we can always avoid the bug by refraining from starting nvidia-k8s-device-plugin until we know that kubelet.sock is in place. That's the approach taken in https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/228.
The issues caused by the device plugin restarts definitely require further investigation though.
https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/228 is merged, which should resolve this in an upcoming Bottlerocket release!
@bcressey has done some great work looking into why nvidia-k8s-device-plugin restarts can sometimes lead to GPU resources failing to be exposed. I'll summarize what he's mentioned to me.
It seems that you can reliably "break" the exposure of GPUs via the device plugin by doing something like this:
for i in {1..100} ; do
echo $i
kill -sHUP $(pgrep -f nvidia-device-plugin)
sleep 0.2
(journalctl -t kubelet -xe|tail -n1|grep 'client connection') && break
done
This is because SIGHUP also results in the device plugin doing a restart, which is what could potentially lead to the broken state we've been discussing.
For some background, you can find more on how device plugins are designed here.
Ok, so how do restarts work?
- The kubelet plugin watcher claims that it's used for device plugins, and that removing the socket will trigger a deregister action. This doesn't appear to be implemented for device plugins.
- What kubelet actually does after connecting to a plugin socket is just wait for
ListAndWatchresponses and bail out if there's an error. - Per this comment on a
grpc-goissue, it seems like there's isn't an API for a grpc server to close a connection, which means that the device-plugin doesn't seem to have a good way to signal to kubelet that it would like to cease the session.
So what is likely happening here is that kubelet is likely trying to continue the "old" session after nvidia-k8s-device-plugin restarts and tries to form a new connection. When the "old" session eventually fails its ListenAndWatch call, we speculate that it is de-registering the "new" registration from the device-plugin.
I'll keep this issue open while:
- We work to include my aforementioned patch in a release of Bottlerocket
- We find the right set of issues for upstream projects relating to the retry behavior issues I mentioned.
thank you very much for the investigation and the upcoming fix!
I got this same error too when I describe nodeclaim. I get this every time when time-slicing is enabled , nodeclaim state is False but node is in Ready state.
Also, while node is in ready state, pod which requested "nvidia.com/gpu: 1" still remains in pending state and not being scheduled. Once i delete the pod and quickly re-apply, pod get scheduled on that node and works but nodeclaim still in False state with same error : Resource "nvidia.com/gpu" was requested but not registered.
And if i delete the pod and wait for karpenter to consolidate that node, it won't happen and I would need to manually delete the nodeclaim to cleanup that node.
If time-slicing is not enabled, I haven't seen this issue.
Bottlerocket version: Bottlerocket OS 1.26.1 (aws-k8s-1.30-nvidia)
Thank for new Bottlerocket version.
I tried again with latest Bottlerocket : Bottlerocket OS 1.26.2 (aws-k8s-1.30-nvidia) and behavior persist.
I am using Karpenter to launch nodes based on below resource requirement in a pod and that triggers Karpenter to launch a GPU node. Node becomes ready quickly but nodeclaim always stays in False state with same error and pod still remains in pending state and never gets scheduled.
Note: We have time-slicing configured.
NodeClaim Error:
Resource "nvidia.com/gpu" was requested but not registered
Resource:
resources:
limits:
# nvidia.com/gpu.shared: 1
nvidia.com/gpu: 1
memory: "4Gi"
cpu: "1"
My assumption is Karpenter working as expected and launches the instance for resource nvidia.com/gpu and that's what was requested but due to time-slicing configuration, node now only advertises nvidia.com/gpu.shared while pod requested nvidia.com/gpu:1 which can be validated in the pod event logs. In the events, Karpenter nominated the newly created node while default-scheduler failed to schedule pod as it is looking for nvidia.com/gpu capacity and that capacity is no more advertised and available.
Events:
Warning FailedScheduling 13s default-scheduler 0/29 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had untolerated taint {workload/xxxxx: }, 1 node(s) had untolerated taint {workload/yyyyy: }, 18 node(s) had untolerated taint {workload/default: }, 2 node(s) had untolerated taint {workload/zzzzz: }, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {workload/aaaaaa: }. preemption: 0/29 nodes are available: 1 No preemption victims found for incoming pod, 28 Preemption is not helpful for scheduling.
Normal Nominated 12s karpenter Pod should schedule on: nodeclaim/gpu-node-pool-mbxrg, node/ip-172-xx-xx-xxx.us-west-2.compute.internal
The new pods with below resource requirements will schedule just fine on same node as shared capacity is available and advertised.
Resource:
resources:
limits:
nvidia.com/gpu.shared: 1
memory: "4Gi"
cpu: "1"
I had to delete the original pod and re-created with shared requirement and it works fine after that. Issue is nodeclaim is False state and will not be re-claimed by Karpenter when there is no workload running.
EDIT: I was able to fix my issue by not renaming the resource name from nvidia.com/gpu to nvidia.com/gpu.shared. In Bottlerocket configuration, it is set to true by default.
Ref: https://bottlerocket.dev/en/os/1.25.x/api/settings/kubelet-device-plugin/#nvidia-time-slicing-rename-by-default
I got a nodeclaim stuck in Unknown state on v1.26.2 (aws-k8s-1.31-nvidia). Again causing pods to not get scheduled.
I do not use time slicing, I don't have any special configuration regarding the GPUs.
Karpenter and event log:
{"level":"INFO","time":"2024-11-15T09:30:18.466Z","logger":"controller","message":"deleted nodeclaim","commit":"6174c75","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"mynodepool-btjcg"},"namespace":"","name":"mynodepool-btjcg","reconcileID":"5f6ddc41-3b7a-4051-be4b-fdfc8cc3fe3e","Node":{"name":"i-0bb15c620a9cf7aba.eu-west-1.compute.internal"},"provider-id":"aws:///eu-west-1a/i-0bb15c620a9cf7aba"}
43x Nominated: Pod should schedule on: nodeclaim/mynodepool-btjcg, node/i-0bb15c620a9cf7aba.eu-west-1.compute.internal (myapp-034e7963-5809-4383-a466-a6399fb2e763-8477f46859-z5h8j.180819b84ef9dc25)
FailedScheduling: Failed to schedule pod, would schedule against uninitialized nodeclaim/mynodepool-btjcg, node/i-0bb15c620a9cf7aba.eu-west-1.compute.internal (myapp-ac1e4646-ac55-4d4b-a254-4f438f9ab53e-5fdfb9f9b4-4d65w.1807dbd7b70f6f8c)
17x Nominated: Pod should schedule on: nodeclaim/mynodepool-btjcg, node/i-0bb15c620a9cf7aba.eu-west-1.compute.internal (myapp-ac1e4646-ac55-4d4b-a254-4f438f9ab53e-5fdfb9f9b4-4d65w.1807dbcbc6fa446d)
DisruptionBlocked: Cannot disrupt NodeClaim: state node isn't initialized (mynodepool-btjcg.1807cda0f184a647)
Nominated: Pod should schedule on: nodeclaim/mynodepool-btjcg, node/i-0bb15c620a9cf7aba.eu-west-1.compute.internal (myapp-b9a89a1f-8621-4b1c-beca-0bdf13f28f41-6f6bc894bc-nxk4z.1807cd9fd93d0299)
Registered: Status condition transitioned, Type: Registered, Status: Unknown -> True, Reason: Registered (mynodepool-btjcg.1807cd88e828352a)
{"level":"INFO","time":"2024-11-14T10:02:49.560Z","logger":"controller","message":"registered nodeclaim","commit":"6174c75","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"mynodepool-btjcg"},"namespace":"","name":"mynodepool-btjcg","reconcileID":"ac2e9b5d-a471-4db3-9cfe-a04b47e89515","provider-id":"aws:///eu-west-1a/i-0bb15c620a9cf7aba","Node":{"name":"i-0bb15c620a9cf7aba.eu-west-1.compute.internal"}}
DisruptionBlocked: Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim (mynodepool-btjcg.1807cd84c0cc2da1)
Launched: Status condition transitioned, Type: Launched, Status: Unknown -> True, Reason: Launched (mynodepool-btjcg.1807cd839b9f2162)
{"level":"INFO","time":"2024-11-14T10:02:26.850Z","logger":"controller","message":"launched nodeclaim","commit":"6174c75","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"mynodepool-btjcg"},"namespace":"","name":"mynodepool-btjcg","reconcileID":"70de79d5-e652-4920-a3ec-aea20d0baf03","provider-id":"aws:///eu-west-1a/i-0bb15c620a9cf7aba","instance-type":"g5.4xlarge","zone":"eu-west-1a","capacity-type":"on-demand","allocatable":{"cpu":"15890m","ephemeral-storage":"538926258176","memory":"57691Mi","nvidia.com/gpu":"1","pods":"234","vpc.amazonaws.com/pod-eni":"34"}}
Nominated: Pod should schedule on: nodeclaim/mynodepool-btjcg (myapp-b9a89a1f-8621-4b1c-beca-0bdf13f28f41-6f6bc894bc-nxk4z.1807cd82fb7efa63)
{"level":"INFO","time":"2024-11-14T10:02:24.196Z","logger":"controller","message":"created nodeclaim","commit":"6174c75","controller":"provisioner","namespace":"","name":"","reconcileID":"862b75c6-8352-4a5f-8222-0069cc615e4a","NodePool":{"name":"mynodepool"},"NodeClaim":{"name":"mynodepool-btjcg"},"requests":{"cpu":"210m","memory":"240Mi","nvidia.com/gpu":"1","pods":"9"},"instance-types":"g5.4xlarge"}
nodeclaim.yaml (right after I requested its deletion)
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
annotations:
compatibility.karpenter.k8s.aws/cluster-name-tagged: "true"
compatibility.karpenter.k8s.aws/kubelet-drift-hash: "15379597991425564585"
karpenter.k8s.aws/ec2nodeclass-hash: "6440581379273964080"
karpenter.k8s.aws/ec2nodeclass-hash-version: v3
karpenter.k8s.aws/tagged: "true"
karpenter.sh/nodepool-hash: "13389053327402833262"
karpenter.sh/nodepool-hash-version: v3
karpenter.sh/stored-version-migrated: "true"
creationTimestamp: "2024-11-14T10:02:24Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2024-11-15T09:25:36Z"
finalizers:
- karpenter.sh/termination
generateName: mynodepool-
generation: 2
labels:
karpenter.k8s.aws/instance-category: g
karpenter.k8s.aws/instance-cpu: "16"
karpenter.k8s.aws/instance-cpu-manufacturer: amd
karpenter.k8s.aws/instance-ebs-bandwidth: "4750"
karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
karpenter.k8s.aws/instance-family: g5
karpenter.k8s.aws/instance-generation: "5"
karpenter.k8s.aws/instance-gpu-count: "1"
karpenter.k8s.aws/instance-gpu-manufacturer: nvidia
karpenter.k8s.aws/instance-gpu-memory: "24576"
karpenter.k8s.aws/instance-gpu-name: a10g
karpenter.k8s.aws/instance-hypervisor: nitro
karpenter.k8s.aws/instance-local-nvme: "600"
karpenter.k8s.aws/instance-memory: "65536"
karpenter.k8s.aws/instance-network-bandwidth: "10000"
karpenter.k8s.aws/instance-size: 4xlarge
karpenter.sh/capacity-type: on-demand
karpenter.sh/nodepool: mynodepool
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node-role: training
node.kubernetes.io/instance-type: g5.4xlarge
nvidia.com/gpu: A10G
topology.k8s.aws/zone-id: euw1-az1
topology.kubernetes.io/region: eu-west-1
topology.kubernetes.io/zone: eu-west-1a
name: mynodepool-btjcg
ownerReferences:
- apiVersion: karpenter.sh/v1
blockOwnerDeletion: true
kind: NodePool
name: mynodepool
uid: 6bd5acad-c9b4-4350-a369-b151eb571089
resourceVersion: "460997815"
uid: 8a2001a9-af0d-4006-8741-1d515f993d70
spec:
expireAfter: Never
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: large-disk
requirements:
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "60000"
- key: node.kubernetes.io/instance-type
operator: In
values:
- g5.4xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: karpenter.sh/nodepool
operator: In
values:
- mynodepool
- key: nvidia.com/gpu
operator: In
values:
- A10G
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- g5
resources:
requests:
cpu: 210m
memory: 240Mi
nvidia.com/gpu: "1"
pods: "9"
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: "true"
status:
allocatable:
cpu: 15890m
ephemeral-storage: "538926258176"
memory: 57691Mi
nvidia.com/gpu: "1"
pods: "234"
vpc.amazonaws.com/pod-eni: "34"
capacity:
cpu: "16"
ephemeral-storage: 600G
memory: 60620Mi
nvidia.com/gpu: "1"
pods: "234"
vpc.amazonaws.com/pod-eni: "34"
conditions:
- lastTransitionTime: "2024-11-14T10:12:26Z"
message: ""
reason: ConsistentStateFound
status: "True"
type: ConsistentStateFound
- lastTransitionTime: "2024-11-14T10:03:07Z"
message: Resource "nvidia.com/gpu" was requested but not registered
reason: ResourceNotRegistered
status: Unknown
type: Initialized
- lastTransitionTime: "2024-11-15T09:25:37Z"
message: ""
reason: InstanceTerminating
status: "True"
type: InstanceTerminating
- lastTransitionTime: "2024-11-14T10:02:26Z"
message: ""
reason: Launched
status: "True"
type: Launched
- lastTransitionTime: "2024-11-14T10:02:49Z"
message: Initialized=Unknown
reason: UnhealthyDependents
status: Unknown
type: Ready
- lastTransitionTime: "2024-11-14T10:02:49Z"
message: ""
reason: Registered
status: "True"
type: Registered
imageID: ami-06455fd9d0e2a0590
nodeName: i-0bb15c620a9cf7aba.eu-west-1.compute.internal
providerID: aws:///eu-west-1a/i-0bb15c620a9cf7aba
Thanks @apjneeraj for following up on timeslicing being an issue. For this particular issue, we are trying to track down issues with the GPU nodes not becoming ready when not using timeslicing. If you find more issues related to timeslicing, please cut us a new issue so we can track it separately!
Thanks @awoimbee, we thought we had fixed this in 1.26.2 but your report helps us dive into what we need to do next. It appears that our original fix did not resolve the root cause. I'm taking a closer look and will report back with what I find.
We got the same issue, some of our GPU nodes (g4dn.xlarge) are failing to initialize the NVIDIA driver on Kubernetes 1.29. The affected image is bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.27.0-0d5d3ecf.
(same as first post)
When the driver fails to initialize, the node does not advertise its nvidia.com/gpu capacity, causing the corresponding pods to remain in a Pending state indefinitely. This prevents any pods scheduled on the affected node from starting.
We are using Karpenter to provision our nodes and got the same error.
Resource "nvidia.com/gpu" was requested but not registered
After digging in, I found that I mistakenly failed to tag my fix into the 1.26.2 release branch, and thus was not released until Bottlerocket 1.27.0.
Apologies for the mixup. For anyone still experiencing this issue, there's a good chance that it is resolved by Bottlerocket 1.27.0 and not 1.26.2 as previously mentioned.
@cogentist-yann's report using 1.27.0 leads me to believe there may additionally be a Karpenter component here that confounds things. I'm attempting to reproduce the issue using Bottlerocket 1.27.0 on Karpenter.
I have attempted to replicate this behavior on 1.27.0 over hundreds of instance launches (on g4dn.xlarge and g5.xlarge) but have thus far been unsuccessful. Here's the Karpenter setup that I used to attempt to reproduce:
https://gist.github.com/cbgbt/b6bda1acd5142a81a0ca6c1c6c23a0b9
For what it's worth, this setup does capture the failures on 1.26.2, so if there's still an issue present, I believe it has some other factor that isn't captured here.
@cogentist-yann I'm wondering if there's some facet that I'm missing in the reproducer. If you happen to hit the issue, do you mind checking the systemd journal for errors that look like the ones @bcressey called out in this comment? Are you enabling any additional Bottlerocket settings deploying any additional Kubernetes services which are related to GPU configuration and scheduling?
@cbgbt thanks,
the ami seems different, it was ami-0b16af158543f0cd1
Here the provisioner used for karpenter
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: gpu-worker
spec:
consolidation:
enabled: true
ttlSecondsUntilExpired: 86400
labels:
nvidia.com/accelerator: gpu
project: cluster-name
tegaki.ai/nodepool: gpu-worker
provider:
amiFamily: Bottlerocket
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
volumeSize: 4Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
volumeSize: 50Gi
volumeType: gp3
instanceProfile: KarpenterInstanceProfile-cluster-name
subnetSelector:
karpenter.sh/discovery/cluster-name: "karpenter"
securityGroupSelector:
karpenter.sh/discovery: "cluster-name"
tags:
project: cluster-name
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- g4dn.xlarge
- g4dn.2xlarge
- key: topology.kubernetes.io/zone
operator: In
values:
- ap-northeast-1a
- ap-northeast-1c
- ap-northeast-1d
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
taints:
- effect: NoSchedule
key: tegaki.ai/nodepool
value: gpu-worker
I will update this message Monday.
UPDATE: @cbgbt
I am not able to get the logs as the problem is not present on today scale up.
The today ami is: ami-0fcf3021eaa69cf2d, bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.27.1-efd46c32.
Thanks @cogentist-yann. I'll run my tests with the specific AMI you've mentioned as well.
Otherwise, I believe this issue is resolved, but I'll leave it open for a while in case there is new data. The reported failure on 1.27.0 is suspicious.
I've run some tests on my end as well with bottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.27.1-efd46c32 , I wasn't able to reproduce the issue 🥳 .
Thanks a lot for solving it!