kuberay
kuberay copied to clipboard
[Feature] Clean up init container configuration and startup sequence.
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
At the moment, we instruct users to include an init container in each worker group spec. The purpose of the init container is to wait for the service exposing the head GCS server to be created before the worker attempts ray start.
There are two issues with the current setup:
- Having to include the init container makes the minimal configuration for a RayCluster messier. If an init container is necessary, it would be better to have the KubeRay operator create it by default.
- The current logic is not quite correct, for the following reason: After the initContainer determines that the head service is ready, the Ray worker container immediately runs Ray start, whether or not the GCS is ready. Ray start has internal retry logic that eventually gives up if the head pod is not started quickly enough -- the worker container will then crash-loop. (This is not that bad given the typical time scales for provisioning Ray pods and given ray start's internal timeout.)
The tasks are to simplify configuration and correct the logic.
Two ways to correct the logic:
- Implement an initContainer that waits for the GCS to be ready.
- Drop the initContainer and just have the Ray container's entry-point wait as long as necessary.
Advantage of 2. is that it's simpler.
Advantage of 1. is that it's perhaps more idiomatic and gives more feedback to a user who is examining worker pod status with kubectl get pod
-- the user can distinguish "Initializing" and "Running" states for the worker container.
If we stick with an initContainer (option 2), we can either
- Have the operator configure it automatically OR
- Leave it alone, leave it to Helm to hide that configuration, and invest in Helm as the preferred means of deploying.
Use case
Interface cleanup.
Related issues
This falls under the generic category of "interface cleanup", for which we have this issue: https://github.com/ray-project/kuberay/issues/368
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
I think shielding the user from the init-container logic has its advantages.
We can have the Operator add the init-container in the pod. (We need to make sure here that we append it to the list of init containers that the user may have already defined for other purposes).
Since Helm is not the only way of deployment, I don't think we should add the init-container there since some users of KubeRay might miss this logic completely if they do not use helm.
One question here is how does the init container "waits for the GCS to be ready"?
All Ray versions since ~1.4.0 have a cli command ray health-check
that can health-check the GCS.
One idea is to use the Ray image for the init container and loop on
ray health-check --address <head-service>:<port>
with a five second backoff, tolerating any error.
There's no overhead from pulling the Ray image, since you need it anyway to run Ray.
It's not urgent to fix in the next release -- it works well enough to copy-paste the extra configuration.
Some recent discussion: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669647595429959 https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1669991115558659
Just a basic ping to the GCS server should do the trick.
Hi @DmitriGekhtman,
Would you mind explaining some details about this issue? I have read this issue and related slack discussions. In my understanding, current solution is to
- Make the KubeRay operator configure the
initContainer
for users.- Use
ray health-check
to wait until GCS is ready.
- Use
- Remove
initContainer
from all sample YAML files. For example, https://github.com/ray-project/kuberay/blob/633ff6375099b2737db4c74a51c31028d2e54bc3/ray-operator/config/samples/ray-cluster.autoscaler.yaml#L125-L129
Is it correct? Thank you!
I'd recommend
First, remove initContainers from all sample configs. They accomplish nothing for recent Ray versions but might have been necessary for successful start up for very old Ray versions for which we do not guarantee compatibility.
Next, investigate whether we need to do change anything for the worker startup sequence.
One way or another, Ray workers need to wait for the head to start. Currently this is accomplished by retrying GCS connection in the ray start
Python code. If there's a timeout, the container crashloops.
Next, investigate whether we need to do change anything for the worker startup sequence. One way or another, Ray workers need to wait for the head to start. Currently this is accomplished by retrying GCS connection in the ray start Python code. If there's a timeout, the container crashloops.
One option is to modify the Ray code to make the number of retries adjustable via env variable. Then have the operator set the env variable. This would prevent crashloops for new Ray versions.
I think the retry logic is here.
First, remove initContainers from all sample configs. They accomplish nothing for recent Ray versions but might have been necessary for successful start up for very old Ray versions for which we do not guarantee compatibility.
What's the difference between "recent Ray versions" and "very old Ray versions" mentioned above? Thank you!
Uh, good question. I actually don't know - @akanso mentioned that the init containers were necessary to get things to work for older Ray versions, roughly two years ago.