kuberay [Bug] --forced-cluster-upgrade Causes termination loop for ray head node

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When I enable --forced-cluster-upgrade on the ray operator (v0.3) then it continuously terminates the ray head node (which it then replaces with an apparently identical pod)
I expect this flag to terminate my cluster no more than once when first enabled

022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "workergroup"}
2022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "workergroup"}
2022-09-13T13:39:37.525Z	INFO	controllers.RayCluster	updateStatus	{"service port's name is empty. Not adding it to RayCluster status.endpoints": {"protocol":"TCP","port":6379,"targetPort":6379}}
2022-09-13T13:39:40.744Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "deployment-dev-eks-kuberay"}
2022-09-13T13:39:40.744Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "deployment-dev-eks-kuberay-head-svc"}

For some reason my RayCluster has this list of workersToDelete which feels relevant.

  workerGroupSpecs:
    - groupName: workergroup
      maxReplicas: 75
      minReplicas: 0
      rayStartParams:
        block: 'true'
        node-ip-address: $MY_POD_IP
        redis-password: LetMeInRay
      replicas: 0
      scaleStrategy:
        workersToDelete:
          - ty-deployment-prod-eks-kuberay-worker-workergroup-5qg4z
          - ty-deployment-prod-eks-kuberay-worker-workergroup-6bhv4
          - ty-deployment-prod-eks-kuberay-worker-workergroup-75tj6
          - ty-deployment-prod-eks-kuberay-worker-workergroup-bwb99
          - ty-deployment-prod-eks-kuberay-worker-workergroup-d6mmb
          - ty-deployment-prod-eks-kuberay-worker-workergroup-d8hjt
          - ty-deployment-prod-eks-kuberay-worker-workergroup-gmvn5
          - ty-deployment-prod-eks-kuberay-worker-workergroup-gv8z8
          - ty-deployment-prod-eks-kuberay-worker-workergroup-hx989
          - ty-deployment-prod-eks-kuberay-worker-workergroup-jh26v
          - ty-deployment-prod-eks-kuberay-worker-workergroup-kb9xv
          - ty-deployment-prod-eks-kuberay-worker-workergroup-lhgpf
          - ty-deployment-prod-eks-kuberay-worker-workergroup-mbb75
          - ty-deployment-prod-eks-kuberay-worker-workergroup-nlvtq
          - ty-deployment-prod-eks-kuberay-worker-workergroup-pp4lk
          - ty-deployment-prod-eks-kuberay-worker-workergroup-q6gkt
          - ty-deployment-prod-eks-kuberay-worker-workergroup-qcj9p
          - ty-deployment-prod-eks-kuberay-worker-workergroup-qmb54
          - ty-deployment-prod-eks-kuberay-worker-workergroup-vrf7c
          - ty-deployment-prod-eks-kuberay-worker-workergroup-xt5bb
          - ty-deployment-prod-eks-kuberay-worker-workergroup-xwl97
          - ty-deployment-prod-eks-kuberay-worker-workergroup-zxcb2

Reproduction script

Deploy kuberay 0.3
Create a raycluster
Add args to kuberay operator deployment

          args: [ '--forced-cluster-upgrade' ]

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Sep 13 '22 13:09 alex-treebeard

Could you share your Ray cluster's config? And, just confirming, is it correct that you didn't update your Ray cluster's configuration (and this happened anyway?)

One theory is that something in your K8s environment is mutating the pod's configuration away from what's configured in the RayCluster CR and the operator is perceiving that as a configuration change that needs to be reconciled.

cc @wilsonwang371

Sep 13 '22 18:09 DmitriGekhtman

^ After taking a look at the code, I'm almost certain that this is what's happening.

There are quite a few circumstances where the K8s environment may mutate a pod's resource requests or limits -- for example, GKE autopilot may do this to suit its internal bin-packing.

With the current implementation of --forced-cluster-upgrade, this would lead to the sort of churning behavior observed in this issue.

Sep 13 '22 18:09 DmitriGekhtman

So, we're going to need to reconsider the design of --forced-cluster-upgrade.

It does not work to compare a pod's configuration against the template in the RayCluster CR.

Rather, we need to compare applied configuration against the previously applied configuration.

One approach is this:

When creating a pod, compute a hash of the applied configuration.
Store the hash in the pod's metadata.labels.
To determine if configuration is outdated, compute the hash from the pod template in the RayCluster CR and check if it matches metadata.labels.

(This is essentially the approach used by the Ray autoscaler to manage virtual machines.)

Sep 13 '22 18:09 DmitriGekhtman

I think this will be finally resolved by supporting rolling upgrade.

We will need to expedite the design and implementation of rolling upgrade support #527

Btw, Dima, can you give me more context on --forced-cluster-upgrade ? @DmitriGekhtman

Sep 14 '22 02:09 wilsonwang371

cc @scarlet25151 @Jeffwan

Sep 14 '22 02:09 wilsonwang371

Btw, Dima, can you give me more context on --forced-cluster-upgrade ? @DmitriGekhtman

Not much more context to share besides that it would be nice to be able to update pod configuration by default! This would give us parity with Ray-on-VM deployment and with the K8s built-ins.

Sep 14 '22 02:09 DmitriGekhtman

some more information for myself and also others: #231 is the original code. We need to change the design.

Sep 14 '22 04:09 wilsonwang371

I can confirm we are using Karpenter which has a mutating webhook.

Thanks for the quick response!

Sep 14 '22 09:09 alex-treebeard

@alex-treebeard could you provide more detail about the situation? For example is there any extra container inject or is there any change to the head pod template by the webhook when a pod is created?

Sep 26 '22 18:09 scarlet25151

@alex-treebeard could you provide more detail about the situation? For example is there any extra container inject or is there any change to the head pod template by the webhook when a pod is created?

With the current implementation, I believe either of these fairly common situations would lead to the behavior observed here.

Sep 27 '22 00:09 DmitriGekhtman

I can confirm we are using Karpenter which has a mutating webhook.

Thanks for the quick response!

Hi Alex,

I guess the webhook will one of the following things:

create extra container in the pod
modify resources
change container image

Can you confirm this? @alex-treebeard

Sep 29 '22 21:09 wilsonwang371

I think it assigns a node to the pod before the kube scheduler has a chance to.

It doesn't add a container or change the image.

Sep 29 '22 23:09 alex-treebeard

I think it assigns a node to the pod before the kube scheduler has a chance to.

It doesn't add a container or change the image.

If we give you a debug build, r u able to help us diagnose this issue?

Sep 30 '22 18:09 wilsonwang371

Thanks @wilsonwang371, I should be able to in the next couple weeks (I'm a bit away from karpenter right now)

Oct 04 '22 08:10 alex-treebeard

@alex-treebeard Hi Alex, do you have any update on this?

Dec 14 '22 05:12 wilsonwang371

The current logic is obviously unacceptable -- for this functionality to work, you need to compare (a hash of) the current pod configuration in the RayCluster CR against (a hash of) the last applied configuration. Comparing the actual pod object against the RayCluster CR is not going to work. cc @kevin85421 @sihanwang41 @architkulkarni Let's track the work over here: https://github.com/ray-project/kuberay/issues/527

Dec 14 '22 16:12 DmitriGekhtman

The current logic is obviously unacceptable -- for this functionality to work, you need to compare (a hash of) the current pod configuration in the RayCluster CR against (a hash of) the last applied configuration. Comparing the actual pod object against the RayCluster CR is not going to work. cc @kevin85421 @sihanwang41 @architkulkarni Let's track the work over here: #527

@scarlet25151 Shall we prioritize this task? This needs to be improved soon.

Dec 14 '22 20:12 wilsonwang371