kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] Bubble ImagePullErr and ImagePullBackoff to the Ray CRD

Open EngHabu opened this issue 1 year ago • 7 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I deployed a rayjob with a bad image reference (image does not exist) The RayJob stayed in "Initializing" phase and didn't get updated/bubble up the error from starting the Driver Pod.

Reproduction script

TBD

Anything else

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

EngHabu avatar Sep 17 '24 19:09 EngHabu

We are working on improving RayCluster observability with new conditions APIs which should hopefully surface these types of failures.

@rueian do you know if the existing implementation would surface ImagePullErr and ImagePullBackoff errors?

See https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY/edit?usp=sharing for more details.

andrewsykim avatar Sep 17 '24 19:09 andrewsykim

Happy to see catching ImagePullBackOff in this doc for Ray CRD. Are there updates on the implementation on this one?

fiedlerNr9 avatar Oct 03 '24 23:10 fiedlerNr9

@fiedlerNr9 can you try with Kuberay v1.2? You need to enable the feature gate for new conditions API. See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/observability.html#raycluster-status-conditions

andrewsykim avatar Oct 04 '24 16:10 andrewsykim

I followed these docs but still see the same behaviour.

k get pods -n jan-playground-development | grep a27tckdrxzgx2s4kc4gb
a27tckdrxzgx2s4kc4gb-n0-0-raycluster-dcs6k-head-phwb8             0/1     ImagePullBackOff        0          2m35s
a27tckdrxzgx2s4kc4gb-n0-0-raycluster-dcs6k-ray-gro-worker-8gxcl   0/1     Init:ImagePullBackOff   0          2m35s
k get rayjobs -n jan-playground-development
NAME                        JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
a27tckdrxzgx2s4kc4gb-n0-0                Initializing        2024-10-04T17:58:32Z              2m19s

Just to be on the same page, I would expect the status of the ray job to change to the status of the underlying pods.

fiedlerNr9 avatar Oct 04 '24 18:10 fiedlerNr9

At the moment we only update the RayCluster status, can you check the status there?

We should support mirroring the new conditions in the RayJob status though

andrewsykim avatar Oct 04 '24 19:10 andrewsykim

I'm sorry for not getting back to you sooner.

We are working on improving RayCluster observability with new conditions APIs which should hopefully surface these types of failures.

@rueian do you know if the existing implementation would surface ImagePullErr and ImagePullBackoff errors?

See https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY/edit?usp=sharing for more details.

The existing StatusCondition implementation only reflects errors when calling the Kube API. We also haven't changed the old Status behavior even when the RayClusterStatusConditions feature gate is enabled. Therefore, errors like ImagePullErr and ImagePullBackoff are not reflected and not bubbled yet.

Right now, we have the following status conditions for the case of ImagePullErr if the feature gate is enabled:

Status:
  Conditions:
    Last Transition Time:   2024-10-04T19:33:59Z
    Message:                containers with unready status: [ray-head]
    Reason:                 ContainersNotReady
    Status:                 False
    Type:                   HeadPodReady
    Last Transition Time:   2024-10-04T19:33:59Z
    Message:                RayCluster Pods are being provisioned for first time
    Reason:                 RayClusterPodsProvisioning
    Status:                 False
    Type:                   RayClusterProvisioned

I think we could improve this by carrying the ImagePullErr/ImagePullBackoff messages into the HeadPodReady and RayClusterProvisioned conditions and then finding a way to bubble this into RayJob.

rueian avatar Oct 04 '24 19:10 rueian

The RayCluster CRD can be considered a set of Kubernetes ReplicaSets (with each head or worker group similar to a ReplicaSet). Therefore, we aimed to make the observability consistent with ReplicaSets. However, Kubernetes ReplicaSets do not provide information on ImagePullBackOff errors.

For example, I created a ReplicaSet with image nginx:1.210 which doesn't exist.

image

Although this is not supported by Kubernetes ReplicaSets, we have received these requests multiple times. We will take it into consideration. If we decide to support this, we should clearly define which Pod-level errors should be surfaced by the KubeRay CR.

kevin85421 avatar Oct 05 '24 23:10 kevin85421