kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] Misleading error message in RayService when upgrading to KubeRay v1.1.0

Open kevin85421 opened this issue 1 year ago • 3 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

 "error": "strconv.Atoi: parsing \"\": invalid syntax",
    "level": "error",
    "logger": "controllers.RayService",
    "msg": "Failed to serialize new RayCluster config. Manual config updates will NOT be tracked accurately. Please manually tear down the cluster and apply a new config.",
    "stacktrace": "github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).shouldPrepareNewRayCluster\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:556\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).reconcileRayCluster\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:397\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:126\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227",
    "ts": "2024-03-30T00:53:31.584Z"
  }

Reproduction script

The user uses ArgoCD to upgrade (1) the KubeRay operator and (2) the CRD with a running RayService. Then, the KubeRay operator will print the error message above, but the RayService still functions well (e.g., in-place updates still work). We need to figure out the reason for printing the misleading message.

Anything else

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

kevin85421 avatar Apr 18 '24 01:04 kevin85421

The error message may not be misleading. I've run into the same after updating KubeRay operator to v1.1.0 and no updates to the ray cluster config were applied anymore e.g. updating the Ray version afterwards. I had to delete and recreate the whole RayService so that a new cluster would be created and only after that, further updates were reconciled again as expected. (And this error message did not appear anymore)

tmyhu avatar Apr 25 '24 23:04 tmyhu

I am also using ArgoCD to upgrade the Kuberay operator from 1.0.0 to 1.1.1 then updating my RayService from 2.10.0 to 2.22.0 but the RayService is not upgrading. The existing 2.10.0 cluster is still there and no new pods to create a new 2.22.0 cluster are created. I think this is the same as what @tmyhu commented above.

For some customers deleting and re-creating the RayService is not possible because we are running production applications that cannot have downtime.

sfrolich avatar May 24 '24 21:05 sfrolich

@kevin85421 should I open a new bug on RayService not upgrading or is this one good enough (even though the title doesn't match my problem)?

sfrolich avatar May 24 '24 21:05 sfrolich

@kevin85421 this came up in a production upgrade today. I'm going to have to disrupt traffic to RayServe and create a new cluster by deleting and re-creating the CR. Can this be slotted for next release? In its current state it isn't production ready yet. Just to be clear, I'm not complaining about the message. The message is correct. What isn't correct is that the cluster will no longer take updates to the CR.

sfrolich avatar Aug 21 '24 19:08 sfrolich

Is this problem specifc to using Argo CD?

andrewsykim avatar Aug 21 '24 19:08 andrewsykim

Requring recreate of RayService is pretty bad, is there a patch fix we can do for v1.1.2 to avoid needing to do that?

andrewsykim avatar Aug 21 '24 19:08 andrewsykim

Is this problem specifc to using Argo CD?

I doubt it but that is what I'm using. It just does automatically what a customer would do manually.

sfrolich avatar Aug 21 '24 19:08 sfrolich

I had a meeting with @sfrolich this afternoon to reproduce the issue. The root cause of the issue is:

KubeRay v1.1.0 introduces a new annotation ray.io/num-worker-groups. Therefore, if a RayCluster is created by a KubeRay operator older than v1.1.0, which doesn't have this annotation, strconv.Atoi will fail.

https://github.com/ray-project/kuberay/blob/a69252e6cc90995a3b3085f14af5bcc6869a2d93/ray-operator/controllers/ray/rayservice_controller.go#L560

There are two solutions to workaround this issue:

  1. Upgrade to KubeRay v1.2.0 instead of KubeRay v1.1.1. In https://github.com/ray-project/kuberay/pull/2320, if a RayCluster was not created by the same version as the current KubeRay operator, the KubeRay operator will first update the CR. Therefore, the annotation ray.io/num-worker-groups will be added to the CR.

  2. Annotate the RayCluster after you upgrade to v1.1.1. For example, the following command add the annotation ray.io/num-worker-groups to the RayCluster CR. Note that this command assumes that the RayCluster CR only has 1 worker group.

kubectl annotate rayclusters.ray.io rayservice-sample-raycluster-wzxd7 ray.io/num-worker-groups=1

Action item here is to update the upgrade guide.

kevin85421 avatar Aug 29 '24 00:08 kevin85421

@kevin85421 I did the following: kuberay 1.0.0 --> add ray service 2.9.3 --> kuberay 1.1.1 --> kuberay 1.2.0 and the error went away. The bad news is I made an update to the existing RayService (upped the CPU for the worker nodes and the new RayCluster came up and the old one went away BUT the Ray head and serve services are pointing to the old cluster :-( I see this new error in the kuberay log {"level":"error","ts":"2024-08-29T22:18:48.057Z","logger":"controllers.RayService","msg":"Reconciler error","RayService":{"name":"ray-service","namespace":"ray-service"},"reconcileID":"bbd52b57-ce4f-479b-a418-33c0fb78592f","error":"Service \"ray-service-head-svc\" is invalid: spec.clusterIPs[0]: Invalid value: []string{\"None\"}: may not change once set","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

sfrolich avatar Aug 29 '24 22:08 sfrolich

The root cause is that v1.2.0 makes the head service headless by default. However, the K8s API server will throw an exception if we attempt to change a running ClusterIP service to a headless service. Thus, the reconciliation loop will return early, and the K8s service for serving will not be reconciled.

The workaround here is to set ENABLE_RAY_HEAD_CLUSTER_IP_SERVICE to true in values.yaml to use the normal ClusterIP instead.

- name: ENABLE_RAY_HEAD_CLUSTER_IP_SERVICE
  value: "true"

kevin85421 avatar Aug 30 '24 00:08 kevin85421

open a PR: https://github.com/ray-project/kuberay/pull/2343

kevin85421 avatar Aug 30 '24 00:08 kevin85421

Will this be a 1.2.0 patch or in the next release?

sfrolich avatar Aug 30 '24 01:08 sfrolich

@sfrolich I will release Helm charts for 1.2.1 to disable the headless service by default, but I will not build images for KubeRay v1.2.1.

kevin85421 avatar Aug 31 '24 00:08 kevin85421

Update: I decided to release both images and Helm charts.

kevin85421 avatar Aug 31 '24 06:08 kevin85421

@kevin85421 I tested 1.2.1 and it was able to switch over the services to the new cluster. Looks good now

sfrolich avatar Sep 04 '24 17:09 sfrolich