kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] I got Reconciler error when change the value of nameOverride in values.yaml of helm installation Ray Cluster

Open nhha1602 opened this issue 1 year ago • 2 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I tried to install a Ray cluster by using heml. In the values.yaml of ray-cluster, I changed the values of nameOverride from "kuberay" to"mykuberay". Then on the log of kuberay-operator pod, it got bellow logs:

2023-12-01T09:18:50.777Z        INFO    controllers.RayCluster  Read request instance not found error!  {"name": "kuberay-dev/raycluster-kuberay"}
2023-12-01T09:19:39.202Z        INFO    controllers.RayCluster  reconciling RayCluster  {"cluster name": "raycluster-mykuberay"}
2023-12-01T09:19:39.202Z        INFO    controllers.RayCluster  Reconciling Ingress
2023-12-01T09:19:39.202Z        INFO    controllers.RayCluster  reconcileHeadService    {"1 head service found": "raycluster-mykuberay-head-svc"}
2023-12-01T09:19:39.202Z        INFO    controllers.RayCluster  reconcilePods   {"Found 1 head Pod": "raycluster-mykuberay-head-qd9b8", "Pod status": "Running", "Pod restart policy": "Always", "Ray container terminated status": "nil"}
2023-12-01T09:19:39.202Z        INFO    controllers.RayCluster  reconcilePods   {"head Pod": "raycluster-mykuberay-head-qd9b8", "shouldDelete": false, "reason": "KubeRay does not need to delete the head Pod raycluster-mykuberay-head-qd9b8. The Pod status is Running, and the Ray container terminated status is nil."}
2023-12-01T09:19:39.202Z        INFO    controllers.RayCluster  reconcilePods   {"desired workerReplicas (always adhering to minReplicas/maxReplica)": 0, "worker group": "cpuGroup", "maxReplicas": 3, "minReplicas": 0, "replicas": 0}
2023-12-01T09:19:39.202Z        INFO    controllers.RayCluster  reconcilePods   {"removing the pods in the scaleStrategy of": "cpuGroup"}
2023-12-01T09:19:39.202Z        INFO    controllers.RayCluster  reconcilePods   {"workerReplicas": 0, "runningPods": 0, "diff": 0}
2023-12-01T09:19:39.202Z        INFO    controllers.RayCluster  reconcilePods   {"all workers already exist for group": "cpuGroup"}
2023-12-01T09:19:39.203Z        INFO    controllers.RayCluster  Got error when calculating new status   {"cluster name": "raycluster-mykuberay", "error": "unable to find head service. cluster name raycluster-mykuberay, filter labels map[app.kubernetes.io/created-by:kuberay-operator app.kubernetes.io/name:kuberay ray.io/cluster:raycluster-mykuberay ray.io/identifier:raycluster-mykuberay-head ray.io/node-type:head]"}
2023-12-01T09:19:39.203Z        ERROR   controller.raycluster-controller        Reconciler error        {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "raycluster-mykuberay", "namespace": "kuberay-dev", "error": "unable to find head service. cluster name raycluster-mykuberay, filter labels map[app.kubernetes.io/created-by:kuberay-operator app.kubernetes.io/name:kuberay ray.io/cluster:raycluster-mykuberay ray.io/identifier:raycluster-mykuberay-head ray.io/node-type:head]"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227

Reproduction script

We can see the default values of nameOverride as following link: https://github.com/ray-project/kuberay-helm/blob/07463a11e78d934f850f0b4ab20cf3b17803b86c/helm-chart/ray-cluster/values.yaml#L13

I changed this value to anther one, then I got error log in kuberay-operator. Please advise this.

Anything else

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

nhha1602 avatar Dec 01 '23 09:12 nhha1602

Hi @kevin85421 - Is this currently being worked on? We are currently running into this error also as we deploy using helm and changing the nameOverride value also. If not being worked on, I'm willing to look into and submit a PR. Thanks!

chrisxstyles avatar Mar 05 '24 16:03 chrisxstyles

@chrisxstyles, thank you! Welcome your PR!

kevin85421 avatar Mar 05 '24 18:03 kevin85421