kuberay
kuberay copied to clipboard
[Bug] KubeRay cluster resource status is reporting Ready when there are pods still pending
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
apiserver
What happened + What you expected to happen
When there are pods stuck in Pending because of insufficient resources, the RayCluster state is reported as ready.
status:
desiredCPU: "22"
desiredGPU: "4"
desiredMemory: 24G
desiredTPU: "0"
desiredWorkerReplicas: 2
endpoints:
client: "10001"
dashboard: "8265"
gcs: "6379"
metrics: "8080"
head:
serviceIP: 172.30.12.150
lastUpdateTime: "2024-06-12T13:35:00Z"
maxWorkerReplicas: 2
minWorkerReplicas: 2
observedGeneration: 2
state: ready
This is the status from the head pod
status:
phase: Pending
conditions:
- type: PodScheduled
status: 'False'
lastProbeTime: null
lastTransitionTime: '2024-06-12T13:55:11Z'
reason: Unschedulable
message: '0/5 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) didn''t match Pod''s node affinity/selector. preemption: 0/5 nodes are available: 1 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..'
qosClass: Burstable
Reproduction script
- Submit a RayCluster that meets the
ClusterQueuequota requirement so that it runs and not inSuspendedstate - The worker node(s) has insufficient resources to run the pods.
Anything else
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
@astefanutti Filed this as per your request.
@tsailiming what's the KubeRay version? In previous versions it is a known isuse that RayCluster status indefinitly ready once it observes all worker pods as running. There's some discussion about it in https://github.com/ray-project/kuberay/pull/1930
From one of the head pod. This is from OpenShift AI 2.9.1.
$ ray --version
ray, version 2.7.1
@tsailiming I meant the KubeRay version, not the Ray version
it seems like head pod was ok before but be re-schedule at sometime, then suck in pending. But RayClusterState was not be updated
https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/raycluster_controller.go#L1196
if utils.CheckAllPodsRunning(ctx, runtimePods) {
newInstance.Status.State = rayv1.Ready
}
#2271 introduces a new condition RayClusterReady and we will gradually deprecate .Status.State. The definition is:
RayClusterReadyindicates whether all Ray Pods are ready when the RayCluster is first created.- After
RayClusterReadyis set to true for the first time, it only indicates whether the RayCluster's head Pod is ready for requests.
Close this issue.