kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] --forced-cluster-upgrade Causes termination loop for ray head node

Open alex-treebeard opened this issue 3 years ago • 17 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

  1. When I enable --forced-cluster-upgrade on the ray operator (v0.3) then it continuously terminates the ray head node (which it then replaces with an apparently identical pod)
  2. I expect this flag to terminate my cluster no more than once when first enabled
022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "workergroup"}
2022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "workergroup"}
2022-09-13T13:39:37.525Z	INFO	controllers.RayCluster	updateStatus	{"service port's name is empty. Not adding it to RayCluster status.endpoints": {"protocol":"TCP","port":6379,"targetPort":6379}}
2022-09-13T13:39:40.744Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "deployment-dev-eks-kuberay"}
2022-09-13T13:39:40.744Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "deployment-dev-eks-kuberay-head-svc"}

For some reason my RayCluster has this list of workersToDelete which feels relevant.

  workerGroupSpecs:
    - groupName: workergroup
      maxReplicas: 75
      minReplicas: 0
      rayStartParams:
        block: 'true'
        node-ip-address: $MY_POD_IP
        redis-password: LetMeInRay
      replicas: 0
      scaleStrategy:
        workersToDelete:
          - ty-deployment-prod-eks-kuberay-worker-workergroup-5qg4z
          - ty-deployment-prod-eks-kuberay-worker-workergroup-6bhv4
          - ty-deployment-prod-eks-kuberay-worker-workergroup-75tj6
          - ty-deployment-prod-eks-kuberay-worker-workergroup-bwb99
          - ty-deployment-prod-eks-kuberay-worker-workergroup-d6mmb
          - ty-deployment-prod-eks-kuberay-worker-workergroup-d8hjt
          - ty-deployment-prod-eks-kuberay-worker-workergroup-gmvn5
          - ty-deployment-prod-eks-kuberay-worker-workergroup-gv8z8
          - ty-deployment-prod-eks-kuberay-worker-workergroup-hx989
          - ty-deployment-prod-eks-kuberay-worker-workergroup-jh26v
          - ty-deployment-prod-eks-kuberay-worker-workergroup-kb9xv
          - ty-deployment-prod-eks-kuberay-worker-workergroup-lhgpf
          - ty-deployment-prod-eks-kuberay-worker-workergroup-mbb75
          - ty-deployment-prod-eks-kuberay-worker-workergroup-nlvtq
          - ty-deployment-prod-eks-kuberay-worker-workergroup-pp4lk
          - ty-deployment-prod-eks-kuberay-worker-workergroup-q6gkt
          - ty-deployment-prod-eks-kuberay-worker-workergroup-qcj9p
          - ty-deployment-prod-eks-kuberay-worker-workergroup-qmb54
          - ty-deployment-prod-eks-kuberay-worker-workergroup-vrf7c
          - ty-deployment-prod-eks-kuberay-worker-workergroup-xt5bb
          - ty-deployment-prod-eks-kuberay-worker-workergroup-xwl97
          - ty-deployment-prod-eks-kuberay-worker-workergroup-zxcb2

Reproduction script

  1. Deploy kuberay 0.3
  2. Create a raycluster
  3. Add args to kuberay operator deployment
          args: [ '--forced-cluster-upgrade' ]

Anything else

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

alex-treebeard avatar Sep 13 '22 13:09 alex-treebeard

Could you share your Ray cluster's config? And, just confirming, is it correct that you didn't update your Ray cluster's configuration (and this happened anyway?)

One theory is that something in your K8s environment is mutating the pod's configuration away from what's configured in the RayCluster CR and the operator is perceiving that as a configuration change that needs to be reconciled.

cc @wilsonwang371

DmitriGekhtman avatar Sep 13 '22 18:09 DmitriGekhtman

^ After taking a look at the code, I'm almost certain that this is what's happening.

There are quite a few circumstances where the K8s environment may mutate a pod's resource requests or limits -- for example, GKE autopilot may do this to suit its internal bin-packing.

With the current implementation of --forced-cluster-upgrade, this would lead to the sort of churning behavior observed in this issue.

DmitriGekhtman avatar Sep 13 '22 18:09 DmitriGekhtman

So, we're going to need to reconsider the design of --forced-cluster-upgrade.

It does not work to compare a pod's configuration against the template in the RayCluster CR.

Rather, we need to compare applied configuration against the previously applied configuration.

One approach is this:

  • When creating a pod, compute a hash of the applied configuration.
  • Store the hash in the pod's metadata.labels.
  • To determine if configuration is outdated, compute the hash from the pod template in the RayCluster CR and check if it matches metadata.labels.

(This is essentially the approach used by the Ray autoscaler to manage virtual machines.)

DmitriGekhtman avatar Sep 13 '22 18:09 DmitriGekhtman

I think this will be finally resolved by supporting rolling upgrade.

We will need to expedite the design and implementation of rolling upgrade support #527

Btw, Dima, can you give me more context on --forced-cluster-upgrade ? @DmitriGekhtman

wilsonwang371 avatar Sep 14 '22 02:09 wilsonwang371

cc @scarlet25151 @Jeffwan

wilsonwang371 avatar Sep 14 '22 02:09 wilsonwang371

Btw, Dima, can you give me more context on --forced-cluster-upgrade ? @DmitriGekhtman

Not much more context to share besides that it would be nice to be able to update pod configuration by default! This would give us parity with Ray-on-VM deployment and with the K8s built-ins.

DmitriGekhtman avatar Sep 14 '22 02:09 DmitriGekhtman

some more information for myself and also others: #231 is the original code. We need to change the design.

wilsonwang371 avatar Sep 14 '22 04:09 wilsonwang371

I can confirm we are using Karpenter which has a mutating webhook.

Thanks for the quick response!

alex-treebeard avatar Sep 14 '22 09:09 alex-treebeard

@alex-treebeard could you provide more detail about the situation? For example is there any extra container inject or is there any change to the head pod template by the webhook when a pod is created?

scarlet25151 avatar Sep 26 '22 18:09 scarlet25151

@alex-treebeard could you provide more detail about the situation? For example is there any extra container inject or is there any change to the head pod template by the webhook when a pod is created?

With the current implementation, I believe either of these fairly common situations would lead to the behavior observed here.

DmitriGekhtman avatar Sep 27 '22 00:09 DmitriGekhtman

I can confirm we are using Karpenter which has a mutating webhook.

Thanks for the quick response!

Hi Alex,

I guess the webhook will one of the following things:

  1. create extra container in the pod
  2. modify resources
  3. change container image

Can you confirm this? @alex-treebeard

wilsonwang371 avatar Sep 29 '22 21:09 wilsonwang371

I think it assigns a node to the pod before the kube scheduler has a chance to.

It doesn't add a container or change the image.

alex-treebeard avatar Sep 29 '22 23:09 alex-treebeard

I think it assigns a node to the pod before the kube scheduler has a chance to.

It doesn't add a container or change the image.

If we give you a debug build, r u able to help us diagnose this issue?

wilsonwang371 avatar Sep 30 '22 18:09 wilsonwang371

Thanks @wilsonwang371, I should be able to in the next couple weeks (I'm a bit away from karpenter right now)

alex-treebeard avatar Oct 04 '22 08:10 alex-treebeard

@alex-treebeard Hi Alex, do you have any update on this?

wilsonwang371 avatar Dec 14 '22 05:12 wilsonwang371

The current logic is obviously unacceptable -- for this functionality to work, you need to compare (a hash of) the current pod configuration in the RayCluster CR against (a hash of) the last applied configuration. Comparing the actual pod object against the RayCluster CR is not going to work. cc @kevin85421 @sihanwang41 @architkulkarni Let's track the work over here: https://github.com/ray-project/kuberay/issues/527

DmitriGekhtman avatar Dec 14 '22 16:12 DmitriGekhtman

The current logic is obviously unacceptable -- for this functionality to work, you need to compare (a hash of) the current pod configuration in the RayCluster CR against (a hash of) the last applied configuration. Comparing the actual pod object against the RayCluster CR is not going to work. cc @kevin85421 @sihanwang41 @architkulkarni Let's track the work over here: #527

@scarlet25151 Shall we prioritize this task? This needs to be improved soon.

wilsonwang371 avatar Dec 14 '22 20:12 wilsonwang371