kuberay
kuberay copied to clipboard
[Bug] Misleading error message in RayService when upgrading to KubeRay v1.1.0
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
"error": "strconv.Atoi: parsing \"\": invalid syntax",
"level": "error",
"logger": "controllers.RayService",
"msg": "Failed to serialize new RayCluster config. Manual config updates will NOT be tracked accurately. Please manually tear down the cluster and apply a new config.",
"stacktrace": "github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).shouldPrepareNewRayCluster\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:556\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).reconcileRayCluster\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:397\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:126\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227",
"ts": "2024-03-30T00:53:31.584Z"
}
Reproduction script
The user uses ArgoCD to upgrade (1) the KubeRay operator and (2) the CRD with a running RayService. Then, the KubeRay operator will print the error message above, but the RayService still functions well (e.g., in-place updates still work). We need to figure out the reason for printing the misleading message.
Anything else
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
The error message may not be misleading. I've run into the same after updating KubeRay operator to v1.1.0 and no updates to the ray cluster config were applied anymore e.g. updating the Ray version afterwards. I had to delete and recreate the whole RayService so that a new cluster would be created and only after that, further updates were reconciled again as expected. (And this error message did not appear anymore)
I am also using ArgoCD to upgrade the Kuberay operator from 1.0.0 to 1.1.1 then updating my RayService from 2.10.0 to 2.22.0 but the RayService is not upgrading. The existing 2.10.0 cluster is still there and no new pods to create a new 2.22.0 cluster are created. I think this is the same as what @tmyhu commented above.
For some customers deleting and re-creating the RayService is not possible because we are running production applications that cannot have downtime.
@kevin85421 should I open a new bug on RayService not upgrading or is this one good enough (even though the title doesn't match my problem)?
@kevin85421 this came up in a production upgrade today. I'm going to have to disrupt traffic to RayServe and create a new cluster by deleting and re-creating the CR. Can this be slotted for next release? In its current state it isn't production ready yet. Just to be clear, I'm not complaining about the message. The message is correct. What isn't correct is that the cluster will no longer take updates to the CR.
Is this problem specifc to using Argo CD?
Requring recreate of RayService is pretty bad, is there a patch fix we can do for v1.1.2 to avoid needing to do that?
Is this problem specifc to using Argo CD?
I doubt it but that is what I'm using. It just does automatically what a customer would do manually.
I had a meeting with @sfrolich this afternoon to reproduce the issue. The root cause of the issue is:
KubeRay v1.1.0 introduces a new annotation ray.io/num-worker-groups. Therefore, if a RayCluster is created by a KubeRay operator older than v1.1.0, which doesn't have this annotation, strconv.Atoi will fail.
https://github.com/ray-project/kuberay/blob/a69252e6cc90995a3b3085f14af5bcc6869a2d93/ray-operator/controllers/ray/rayservice_controller.go#L560
There are two solutions to workaround this issue:
-
Upgrade to KubeRay v1.2.0 instead of KubeRay v1.1.1. In https://github.com/ray-project/kuberay/pull/2320, if a RayCluster was not created by the same version as the current KubeRay operator, the KubeRay operator will first update the CR. Therefore, the annotation ray.io/num-worker-groups will be added to the CR.
-
Annotate the RayCluster after you upgrade to v1.1.1. For example, the following command add the annotation
ray.io/num-worker-groupsto the RayCluster CR. Note that this command assumes that the RayCluster CR only has 1 worker group.
kubectl annotate rayclusters.ray.io rayservice-sample-raycluster-wzxd7 ray.io/num-worker-groups=1
Action item here is to update the upgrade guide.
@kevin85421 I did the following: kuberay 1.0.0 --> add ray service 2.9.3 --> kuberay 1.1.1 --> kuberay 1.2.0 and the error went away. The bad news is I made an update to the existing RayService (upped the CPU for the worker nodes and the new RayCluster came up and the old one went away BUT the Ray head and serve services are pointing to the old cluster :-( I see this new error in the kuberay log
{"level":"error","ts":"2024-08-29T22:18:48.057Z","logger":"controllers.RayService","msg":"Reconciler error","RayService":{"name":"ray-service","namespace":"ray-service"},"reconcileID":"bbd52b57-ce4f-479b-a418-33c0fb78592f","error":"Service \"ray-service-head-svc\" is invalid: spec.clusterIPs[0]: Invalid value: []string{\"None\"}: may not change once set","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
The root cause is that v1.2.0 makes the head service headless by default. However, the K8s API server will throw an exception if we attempt to change a running ClusterIP service to a headless service. Thus, the reconciliation loop will return early, and the K8s service for serving will not be reconciled.
The workaround here is to set ENABLE_RAY_HEAD_CLUSTER_IP_SERVICE to true in values.yaml to use the normal ClusterIP instead.
- name: ENABLE_RAY_HEAD_CLUSTER_IP_SERVICE
value: "true"
open a PR: https://github.com/ray-project/kuberay/pull/2343
Will this be a 1.2.0 patch or in the next release?
@sfrolich I will release Helm charts for 1.2.1 to disable the headless service by default, but I will not build images for KubeRay v1.2.1.
Update: I decided to release both images and Helm charts.
@kevin85421 I tested 1.2.1 and it was able to switch over the services to the new cluster. Looks good now