kuberay [Feature] Clean up init container configuration and startup sequence.

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

At the moment, we instruct users to include an init container in each worker group spec. The purpose of the init container is to wait for the service exposing the head GCS server to be created before the worker attempts ray start.

There are two issues with the current setup:

Having to include the init container makes the minimal configuration for a RayCluster messier. If an init container is necessary, it would be better to have the KubeRay operator create it by default.
The current logic is not quite correct, for the following reason: After the initContainer determines that the head service is ready, the Ray worker container immediately runs Ray start, whether or not the GCS is ready. Ray start has internal retry logic that eventually gives up if the head pod is not started quickly enough -- the worker container will then crash-loop. (This is not that bad given the typical time scales for provisioning Ray pods and given ray start's internal timeout.)

The tasks are to simplify configuration and correct the logic.

Two ways to correct the logic:

Implement an initContainer that waits for the GCS to be ready.
Drop the initContainer and just have the Ray container's entry-point wait as long as necessary.

Advantage of 2. is that it's simpler.

Advantage of 1. is that it's perhaps more idiomatic and gives more feedback to a user who is examining worker pod status with kubectl get pod -- the user can distinguish "Initializing" and "Running" states for the worker container.

If we stick with an initContainer (option 2), we can either

Have the operator configure it automatically OR
Leave it alone, leave it to Helm to hide that configuration, and invest in Helm as the preferred means of deploying.

Use case

Interface cleanup.

Related issues

This falls under the generic category of "interface cleanup", for which we have this issue: https://github.com/ray-project/kuberay/issues/368

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Aug 15 '22 21:08 DmitriGekhtman

I think shielding the user from the init-container logic has its advantages.

We can have the Operator add the init-container in the pod. (We need to make sure here that we append it to the list of init containers that the user may have already defined for other purposes).

Since Helm is not the only way of deployment, I don't think we should add the init-container there since some users of KubeRay might miss this logic completely if they do not use helm.

One question here is how does the init container "waits for the GCS to be ready"?

Aug 15 '22 22:08 akanso

All Ray versions since ~1.4.0 have a cli command ray health-check that can health-check the GCS.

One idea is to use the Ray image for the init container and loop on ray health-check --address <head-service>:<port> with a five second backoff, tolerating any error.

There's no overhead from pulling the Ray image, since you need it anyway to run Ray.

Aug 15 '22 22:08 DmitriGekhtman

It's not urgent to fix in the next release -- it works well enough to copy-paste the extra configuration.

Nov 04 '22 15:11 DmitriGekhtman

Some recent discussion: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959 https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669991115558659

Just a basic ping to the GCS server should do the trick.

Dec 02 '22 19:12 DmitriGekhtman

Hi @DmitriGekhtman,

Would you mind explaining some details about this issue? I have read this issue and related slack discussions. In my understanding, current solution is to

Make the KubeRay operator configure the initContainer for users.
- Use ray health-check to wait until GCS is ready.
Remove initContainer from all sample YAML files. For example, https://github.com/ray-project/kuberay/blob/633ff6375099b2737db4c74a51c31028d2e54bc3/ray-operator/config/samples/ray-cluster.autoscaler.yaml#L125-L129

Is it correct? Thank you!

Jan 01 '23 13:01 kevin85421

I'd recommend

First, remove initContainers from all sample configs. They accomplish nothing for recent Ray versions but might have been necessary for successful start up for very old Ray versions for which we do not guarantee compatibility.

Next, investigate whether we need to do change anything for the worker startup sequence. One way or another, Ray workers need to wait for the head to start. Currently this is accomplished by retrying GCS connection in the ray start Python code. If there's a timeout, the container crashloops.

Jan 01 '23 20:01 DmitriGekhtman

Next, investigate whether we need to do change anything for the worker startup sequence. One way or another, Ray workers need to wait for the head to start. Currently this is accomplished by retrying GCS connection in the ray start Python code. If there's a timeout, the container crashloops.

One option is to modify the Ray code to make the number of retries adjustable via env variable. Then have the operator set the env variable. This would prevent crashloops for new Ray versions.

I think the retry logic is here.

Jan 01 '23 21:01 DmitriGekhtman

First, remove initContainers from all sample configs. They accomplish nothing for recent Ray versions but might have been necessary for successful start up for very old Ray versions for which we do not guarantee compatibility.

What's the difference between "recent Ray versions" and "very old Ray versions" mentioned above? Thank you!

Jan 02 '23 16:01 kevin85421

Uh, good question. I actually don't know - @akanso mentioned that the init containers were necessary to get things to work for older Ray versions, roughly two years ago.

Jan 02 '23 16:01 DmitriGekhtman

kuberay kuberay copied to clipboard

[Feature] Clean up init container configuration and startup sequence.

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

kuberay
kuberay copied to clipboard