[Bug] --forced-cluster-upgrade Causes termination loop for ray head node
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
- When I enable
--forced-cluster-upgradeon the ray operator (v0.3) then it continuously terminates the ray head node (which it then replaces with an apparently identical pod) - I expect this flag to terminate my cluster no more than once when first enabled
022-09-13T13:39:37.524Z INFO controllers.RayCluster reconcilePods {"head pod found": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z INFO controllers.RayCluster reconcilePods {"head pod is up and running... checking workers": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "workergroup"}
2022-09-13T13:39:37.524Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "workergroup"}
2022-09-13T13:39:37.525Z INFO controllers.RayCluster updateStatus {"service port's name is empty. Not adding it to RayCluster status.endpoints": {"protocol":"TCP","port":6379,"targetPort":6379}}
2022-09-13T13:39:40.744Z INFO controllers.RayCluster reconciling RayCluster {"cluster name": "deployment-dev-eks-kuberay"}
2022-09-13T13:39:40.744Z INFO controllers.RayCluster reconcileServices {"headService service found": "deployment-dev-eks-kuberay-head-svc"}
For some reason my RayCluster has this list of workersToDelete which feels relevant.
workerGroupSpecs:
- groupName: workergroup
maxReplicas: 75
minReplicas: 0
rayStartParams:
block: 'true'
node-ip-address: $MY_POD_IP
redis-password: LetMeInRay
replicas: 0
scaleStrategy:
workersToDelete:
- ty-deployment-prod-eks-kuberay-worker-workergroup-5qg4z
- ty-deployment-prod-eks-kuberay-worker-workergroup-6bhv4
- ty-deployment-prod-eks-kuberay-worker-workergroup-75tj6
- ty-deployment-prod-eks-kuberay-worker-workergroup-bwb99
- ty-deployment-prod-eks-kuberay-worker-workergroup-d6mmb
- ty-deployment-prod-eks-kuberay-worker-workergroup-d8hjt
- ty-deployment-prod-eks-kuberay-worker-workergroup-gmvn5
- ty-deployment-prod-eks-kuberay-worker-workergroup-gv8z8
- ty-deployment-prod-eks-kuberay-worker-workergroup-hx989
- ty-deployment-prod-eks-kuberay-worker-workergroup-jh26v
- ty-deployment-prod-eks-kuberay-worker-workergroup-kb9xv
- ty-deployment-prod-eks-kuberay-worker-workergroup-lhgpf
- ty-deployment-prod-eks-kuberay-worker-workergroup-mbb75
- ty-deployment-prod-eks-kuberay-worker-workergroup-nlvtq
- ty-deployment-prod-eks-kuberay-worker-workergroup-pp4lk
- ty-deployment-prod-eks-kuberay-worker-workergroup-q6gkt
- ty-deployment-prod-eks-kuberay-worker-workergroup-qcj9p
- ty-deployment-prod-eks-kuberay-worker-workergroup-qmb54
- ty-deployment-prod-eks-kuberay-worker-workergroup-vrf7c
- ty-deployment-prod-eks-kuberay-worker-workergroup-xt5bb
- ty-deployment-prod-eks-kuberay-worker-workergroup-xwl97
- ty-deployment-prod-eks-kuberay-worker-workergroup-zxcb2
Reproduction script
- Deploy kuberay 0.3
- Create a raycluster
- Add args to kuberay operator deployment
args: [ '--forced-cluster-upgrade' ]
Anything else
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Could you share your Ray cluster's config? And, just confirming, is it correct that you didn't update your Ray cluster's configuration (and this happened anyway?)
One theory is that something in your K8s environment is mutating the pod's configuration away from what's configured in the RayCluster CR and the operator is perceiving that as a configuration change that needs to be reconciled.
cc @wilsonwang371
^ After taking a look at the code, I'm almost certain that this is what's happening.
There are quite a few circumstances where the K8s environment may mutate a pod's resource requests or limits -- for example, GKE autopilot may do this to suit its internal bin-packing.
With the current implementation of --forced-cluster-upgrade, this would lead to the sort of churning behavior observed in this issue.
So, we're going to need to reconsider the design of --forced-cluster-upgrade.
It does not work to compare a pod's configuration against the template in the RayCluster CR.
Rather, we need to compare applied configuration against the previously applied configuration.
One approach is this:
- When creating a pod, compute a hash of the applied configuration.
- Store the hash in the pod's metadata.labels.
- To determine if configuration is outdated, compute the hash from the pod template in the RayCluster CR and check if it matches metadata.labels.
(This is essentially the approach used by the Ray autoscaler to manage virtual machines.)
I think this will be finally resolved by supporting rolling upgrade.
We will need to expedite the design and implementation of rolling upgrade support #527
Btw, Dima, can you give me more context on --forced-cluster-upgrade ? @DmitriGekhtman
cc @scarlet25151 @Jeffwan
Btw, Dima, can you give me more context on --forced-cluster-upgrade ? @DmitriGekhtman
Not much more context to share besides that it would be nice to be able to update pod configuration by default! This would give us parity with Ray-on-VM deployment and with the K8s built-ins.
some more information for myself and also others: #231 is the original code. We need to change the design.
I can confirm we are using Karpenter which has a mutating webhook.
Thanks for the quick response!
@alex-treebeard could you provide more detail about the situation? For example is there any extra container inject or is there any change to the head pod template by the webhook when a pod is created?
@alex-treebeard could you provide more detail about the situation? For example is there any extra container inject or is there any change to the head pod template by the webhook when a pod is created?
With the current implementation, I believe either of these fairly common situations would lead to the behavior observed here.
I can confirm we are using Karpenter which has a mutating webhook.
Thanks for the quick response!
Hi Alex,
I guess the webhook will one of the following things:
- create extra container in the pod
- modify resources
- change container image
Can you confirm this? @alex-treebeard
I think it assigns a node to the pod before the kube scheduler has a chance to.
It doesn't add a container or change the image.
I think it assigns a node to the pod before the kube scheduler has a chance to.
It doesn't add a container or change the image.
If we give you a debug build, r u able to help us diagnose this issue?
Thanks @wilsonwang371, I should be able to in the next couple weeks (I'm a bit away from karpenter right now)
@alex-treebeard Hi Alex, do you have any update on this?
The current logic is obviously unacceptable -- for this functionality to work, you need to compare (a hash of) the current pod configuration in the RayCluster CR against (a hash of) the last applied configuration. Comparing the actual pod object against the RayCluster CR is not going to work. cc @kevin85421 @sihanwang41 @architkulkarni Let's track the work over here: https://github.com/ray-project/kuberay/issues/527
The current logic is obviously unacceptable -- for this functionality to work, you need to compare (a hash of) the current pod configuration in the RayCluster CR against (a hash of) the last applied configuration. Comparing the actual pod object against the RayCluster CR is not going to work. cc @kevin85421 @sihanwang41 @architkulkarni Let's track the work over here: #527
@scarlet25151 Shall we prioritize this task? This needs to be improved soon.