kuberay [Bug] Misleading error message in RayService when upgrading to KubeRay v1.1.0

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

 "error": "strconv.Atoi: parsing \"\": invalid syntax",
    "level": "error",
    "logger": "controllers.RayService",
    "msg": "Failed to serialize new RayCluster config. Manual config updates will NOT be tracked accurately. Please manually tear down the cluster and apply a new config.",
    "stacktrace": "github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).shouldPrepareNewRayCluster\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:556\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).reconcileRayCluster\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:397\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:126\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227",
    "ts": "2024-03-30T00:53:31.584Z"
  }

Reproduction script

The user uses ArgoCD to upgrade (1) the KubeRay operator and (2) the CRD with a running RayService. Then, the KubeRay operator will print the error message above, but the RayService still functions well (e.g., in-place updates still work). We need to figure out the reason for printing the misleading message.

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Apr 18 '24 01:04 kevin85421

The error message may not be misleading. I've run into the same after updating KubeRay operator to v1.1.0 and no updates to the ray cluster config were applied anymore e.g. updating the Ray version afterwards. I had to delete and recreate the whole RayService so that a new cluster would be created and only after that, further updates were reconciled again as expected. (And this error message did not appear anymore)

Apr 25 '24 23:04 tmyhu

I am also using ArgoCD to upgrade the Kuberay operator from 1.0.0 to 1.1.1 then updating my RayService from 2.10.0 to 2.22.0 but the RayService is not upgrading. The existing 2.10.0 cluster is still there and no new pods to create a new 2.22.0 cluster are created. I think this is the same as what @tmyhu commented above.

For some customers deleting and re-creating the RayService is not possible because we are running production applications that cannot have downtime.

May 24 '24 21:05 sfrolich

@kevin85421 should I open a new bug on RayService not upgrading or is this one good enough (even though the title doesn't match my problem)?

May 24 '24 21:05 sfrolich

@kevin85421 this came up in a production upgrade today. I'm going to have to disrupt traffic to RayServe and create a new cluster by deleting and re-creating the CR. Can this be slotted for next release? In its current state it isn't production ready yet. Just to be clear, I'm not complaining about the message. The message is correct. What isn't correct is that the cluster will no longer take updates to the CR.

Aug 21 '24 19:08 sfrolich

Is this problem specifc to using Argo CD?

Aug 21 '24 19:08 andrewsykim

Requring recreate of RayService is pretty bad, is there a patch fix we can do for v1.1.2 to avoid needing to do that?

Aug 21 '24 19:08 andrewsykim

Is this problem specifc to using Argo CD?

I doubt it but that is what I'm using. It just does automatically what a customer would do manually.

Aug 21 '24 19:08 sfrolich

I had a meeting with @sfrolich this afternoon to reproduce the issue. The root cause of the issue is:

KubeRay v1.1.0 introduces a new annotation ray.io/num-worker-groups. Therefore, if a RayCluster is created by a KubeRay operator older than v1.1.0, which doesn't have this annotation, strconv.Atoi will fail.

https://github.com/ray-project/kuberay/blob/a69252e6cc90995a3b3085f14af5bcc6869a2d93/ray-operator/controllers/ray/rayservice_controller.go#L560

There are two solutions to workaround this issue:

Upgrade to KubeRay v1.2.0 instead of KubeRay v1.1.1. In https://github.com/ray-project/kuberay/pull/2320, if a RayCluster was not created by the same version as the current KubeRay operator, the KubeRay operator will first update the CR. Therefore, the annotation ray.io/num-worker-groups will be added to the CR.
Annotate the RayCluster after you upgrade to v1.1.1. For example, the following command add the annotation ray.io/num-worker-groups to the RayCluster CR. Note that this command assumes that the RayCluster CR only has 1 worker group.

kubectl annotate rayclusters.ray.io rayservice-sample-raycluster-wzxd7 ray.io/num-worker-groups=1

Action item here is to update the upgrade guide.

Aug 29 '24 00:08 kevin85421

@kevin85421 I did the following: kuberay 1.0.0 --> add ray service 2.9.3 --> kuberay 1.1.1 --> kuberay 1.2.0 and the error went away. The bad news is I made an update to the existing RayService (upped the CPU for the worker nodes and the new RayCluster came up and the old one went away BUT the Ray head and serve services are pointing to the old cluster :-( I see this new error in the kuberay log {"level":"error","ts":"2024-08-29T22:18:48.057Z","logger":"controllers.RayService","msg":"Reconciler error","RayService":{"name":"ray-service","namespace":"ray-service"},"reconcileID":"bbd52b57-ce4f-479b-a418-33c0fb78592f","error":"Service \"ray-service-head-svc\" is invalid: spec.clusterIPs[0]: Invalid value: []string{\"None\"}: may not change once set","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

Aug 29 '24 22:08 sfrolich

The root cause is that v1.2.0 makes the head service headless by default. However, the K8s API server will throw an exception if we attempt to change a running ClusterIP service to a headless service. Thus, the reconciliation loop will return early, and the K8s service for serving will not be reconciled.

The workaround here is to set ENABLE_RAY_HEAD_CLUSTER_IP_SERVICE to true in values.yaml to use the normal ClusterIP instead.

- name: ENABLE_RAY_HEAD_CLUSTER_IP_SERVICE
  value: "true"

Aug 30 '24 00:08 kevin85421

open a PR: https://github.com/ray-project/kuberay/pull/2343

Aug 30 '24 00:08 kevin85421

Will this be a 1.2.0 patch or in the next release?

Aug 30 '24 01:08 sfrolich

@sfrolich I will release Helm charts for 1.2.1 to disable the headless service by default, but I will not build images for KubeRay v1.2.1.

Aug 31 '24 00:08 kevin85421

Update: I decided to release both images and Helm charts.

Aug 31 '24 06:08 kevin85421

@kevin85421 I tested 1.2.1 and it was able to switch over the services to the new cluster. Looks good now

Sep 04 '24 17:09 sfrolich

kuberay kuberay copied to clipboard

[Bug] Misleading error message in RayService when upgrading to KubeRay v1.1.0

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kuberay
kuberay copied to clipboard