Pod Stuck in Pending State After GPU Node Restart, Requires Manual Deletion

Open wenzhaojie opened this issue 3 weeks ago • 0 comments

What happened:

After a GPU node (x99) was manually powered off and then restarted, a Pod requesting partial GPU cores and memory remained stuck in the Pending state. The Pod had been running normally on the GPU node before shutdown. Only after manually deleting and recreating the Pod did it successfully schedule and run again. HAMi's scheduler did not automatically reschedule the Pod once the node recovered to Ready state.

What you expected to happen:

When the GPU node (x99) recovered and returned to the Ready state, the Pod should have been automatically rescheduled and resumed running on the available node without requiring manual deletion and recreation.

How to reproduce it (as minimally and precisely as possible):

Create a Pod on the GPU Worker node (x99) requesting a fraction of GPU cores and memory (e.g., using HAMi's resource annotation like hami.io/gpu-core: "30")
Manually shut down the Worker node (x99), causing it to go offline in Kubernetes
Observe the Pod entering the Pending state
Restart the Worker node and wait for it to return to Ready
Notice that the Pod remains in the Pending state and is not automatically rescheduled
Manually delete the Pod; the newly created Pod is then scheduled and runs normally

Anything else we need to know?:

Current system pod status (both HAMi components are running normally on node x99):

wzj@X99:~/Desktop$ kubectl get pod -n kube-system -o wide
NAME                                                       READY   STATUS    RESTARTS        AGE     IP               NODE                               NOMINATED NODE   READINESS GATES
coredns-7cc97dffdd-vhl2n                                   1/1     Running   0               2d19h   10.244.0.3       wzj-standard-pc-i440fx-piix-1996   <none>           <none>
coredns-7cc97dffdd-z9gjx                                   1/1     Running   0               2d19h   10.244.0.2       wzj-standard-pc-i440fx-piix-1996   <none>           <none>
etcd-wzj-standard-pc-i440fx-piix-1996                      1/1     Running   3 (2d19h ago)   2d19h   192.168.31.209   wzj-standard-pc-i440fx-piix-1996   <none>           <none>
hami-device-plugin-r6jwf                                   2/2     Running   4 (13h ago)     2d10h   192.168.31.88    x99                                <none>           <none>
hami-scheduler-5b65bf77c-m7bq7                             2/2     Running   0               21h     10.244.1.40      x99                                <none>           <none>
kube-apiserver-wzj-standard-pc-i440fx-piix-1996            1/1     Running   3 (2d19h ago)   2d19h   192.168.31.209   wzj-standard-pc-i440fx-piix-1996   <none>           <none>
kube-controller-manager-wzj-standard-pc-i440fx-piix-1996   1/1     Running   2 (2d19h ago)   2d19h   192.168.31.209   wzj-standard-pc-i440fx-piix-1996   <none>           <none>
kube-proxy-2jrk6                                           1/1     Running   4 (13h ago)     2d19h   192.168.31.88    x99                                <none>           <none>
kube-proxy-z2rvs                                           1/1     Running   1 (2d19h ago)   2d19h   192.168.31.209   wzj-standard-pc-i440fx-piix-1996   <none>           <none>
kube-scheduler-wzj-standard-pc-i440fx-piix-1996            1/1     Running   3 (2d19h ago)   2d19h   192.168.31.209   wzj-standard-pc-i440fx-piix-1996   <none>           <none>

Cluster Information:

Cluster topology: 1 Master node (wzj-standard-pc-i440fx-piix-1996) + 1 GPU Worker node (x99)
Issue occurs with Pods requesting partial GPU resources (GPU cores and memory)
After node failure recovery, HAMi scheduler appears to retain stale scheduling information

Debug Information (to be collected from GPU node x99):

# 1. NVIDIA system information
nvidia-smi -a

# 2. Container runtime configuration
# For Docker:
cat /etc/docker/daemon.json
# For Containerd:
cat /etc/containerd/config.toml

# 3. HAMi device plugin logs
kubectl logs -n kube-system hami-device-plugin-r6jwf

# 4. HAMi scheduler logs
kubectl logs -n kube-system hami-scheduler-5b65bf77c-m7bq7

# 5. Kubelet logs on the GPU node
sudo journalctl -r -u kubelet

# 6. Kernel messages related to GPU/NVIDIA
dmesg | grep -i nvidia
dmesg | grep -i gpu

Environment:

HAMi version: 2.7.1
Kubernetes version: 1.34.3
nvidia driver version: (to be filled)
Container runtime and version: (to be filled)
OS: (to be filled)
Kernel version: (from uname -a)
GPU model: (from nvidia-smi)
Pod resource request example: hami.io/gpu-core: "30", hami.io/gpu-memory: "1024"

Additional Context: This issue seems related to how HAMi handles partial GPU resource binding during node failure scenarios. When the node recovers, the partially allocated GPU resources may not be properly released or re-registered in the scheduler's state. The problem appears specific to partial GPU allocation, as full GPU allocation may work differently.

Instructions for maintainers:

The issue is reproducible with partial GPU resource requests
Both HAMi components (scheduler and device-plugin) are running normally
The problem occurs specifically during node failure/recovery cycles
Manual Pod deletion forces a fresh scheduling decision, which succeeds

Screenshots/Logs: Will be provided upon request after collecting the debug information from the GPU node.

Dec 24 '25 12:12 wenzhaojie