kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

Kuberay operator Deployment strategy type should be Recreate

Open haoxins opened this issue 2 years ago • 2 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

We should change the Kuberay operator deployment's strategy.type from the default (RollingUpdate) to Recreate to avoid potential problems when a new operator pod is deployed during upgrade.

Only one operator pod is supposed to run at any given time is a better solution to avoid inconsistencies, and without leader election.

see also: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy

BTW, we should also remove the replicaCount: 1 from the values.yaml and hard code as 1 in the deployment.yaml

Reproduction script

No need

Anything else

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

haoxins avatar Sep 10 '22 14:09 haoxins

@akanso @Jeffwan do you have thoughts on the deployment strategy for the operator? Recreate does sound safer, possibly at the the expense of a bit more downtime.

Removing the replica count field sounds like a good idea until we make leader election possible: https://github.com/ray-project/kuberay/issues/474

Another option is to expose both strategy and replica fields, default them to Recreate and 1 respectively, and give a warning not to mess with the fields unless you know what you're doing.

DmitriGekhtman avatar Sep 14 '22 17:09 DmitriGekhtman

I agree, since the operator affects the K8s cluster state, we do not want two active operator pods running at the same time.

Since the startup time is negligible for the operator, we can afford to have a recreate policy, or a maxSurge =0

akanso avatar Sep 14 '22 18:09 akanso