kuberay
kuberay copied to clipboard
[Bug] Bubble ImagePullErr and ImagePullBackoff to the Ray CRD
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
I deployed a rayjob with a bad image reference (image does not exist) The RayJob stayed in "Initializing" phase and didn't get updated/bubble up the error from starting the Driver Pod.
Reproduction script
TBD
Anything else
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
We are working on improving RayCluster observability with new conditions APIs which should hopefully surface these types of failures.
@rueian do you know if the existing implementation would surface ImagePullErr and ImagePullBackoff errors?
See https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY/edit?usp=sharing for more details.
Happy to see catching ImagePullBackOff in this doc for Ray CRD. Are there updates on the implementation on this one?
@fiedlerNr9 can you try with Kuberay v1.2? You need to enable the feature gate for new conditions API. See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/observability.html#raycluster-status-conditions
I followed these docs but still see the same behaviour.
k get pods -n jan-playground-development | grep a27tckdrxzgx2s4kc4gb
a27tckdrxzgx2s4kc4gb-n0-0-raycluster-dcs6k-head-phwb8 0/1 ImagePullBackOff 0 2m35s
a27tckdrxzgx2s4kc4gb-n0-0-raycluster-dcs6k-ray-gro-worker-8gxcl 0/1 Init:ImagePullBackOff 0 2m35s
k get rayjobs -n jan-playground-development
NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE
a27tckdrxzgx2s4kc4gb-n0-0 Initializing 2024-10-04T17:58:32Z 2m19s
Just to be on the same page, I would expect the status of the ray job to change to the status of the underlying pods.
At the moment we only update the RayCluster status, can you check the status there?
We should support mirroring the new conditions in the RayJob status though
I'm sorry for not getting back to you sooner.
We are working on improving RayCluster observability with new conditions APIs which should hopefully surface these types of failures.
@rueian do you know if the existing implementation would surface ImagePullErr and ImagePullBackoff errors?
See https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY/edit?usp=sharing for more details.
The existing StatusCondition implementation only reflects errors when calling the Kube API. We also haven't changed the old Status behavior even when the RayClusterStatusConditions feature gate is enabled. Therefore, errors like ImagePullErr and ImagePullBackoff are not reflected and not bubbled yet.
Right now, we have the following status conditions for the case of ImagePullErr if the feature gate is enabled:
Status:
Conditions:
Last Transition Time: 2024-10-04T19:33:59Z
Message: containers with unready status: [ray-head]
Reason: ContainersNotReady
Status: False
Type: HeadPodReady
Last Transition Time: 2024-10-04T19:33:59Z
Message: RayCluster Pods are being provisioned for first time
Reason: RayClusterPodsProvisioning
Status: False
Type: RayClusterProvisioned
I think we could improve this by carrying the ImagePullErr/ImagePullBackoff messages into the HeadPodReady and RayClusterProvisioned conditions and then finding a way to bubble this into RayJob.
The RayCluster CRD can be considered a set of Kubernetes ReplicaSets (with each head or worker group similar to a ReplicaSet). Therefore, we aimed to make the observability consistent with ReplicaSets. However, Kubernetes ReplicaSets do not provide information on ImagePullBackOff errors.
For example, I created a ReplicaSet with image nginx:1.210 which doesn't exist.
Although this is not supported by Kubernetes ReplicaSets, we have received these requests multiple times. We will take it into consideration. If we decide to support this, we should clearly define which Pod-level errors should be surfaced by the KubeRay CR.