kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] `wait-gcs-ready` init-container going out-of-memory indefinitely (OOMKilled)

Open bluenote10 opened this issue 1 year ago • 9 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

We are unable to use Ray on Kubernetes, because our workers are crashing from out-of-memories in the wait-gcs-ready init-container. This results in an infinite backoff loop trying to re-run the init-container, but it seems like it will never succeed, and therefore no workers are available.

A kubectl describe ourclustername-cpu-group-worker-2sbdj for instance reveals:

Init Containers:
  wait-gcs-ready:
    [...]
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 13 Jan 2025 12:17:25 +0100
      Finished:     Mon, 13 Jan 2025 12:18:07 +0100
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi

Note that the upper memory limit of 256 Mi is rather low, and seems to be coming from here:

https://github.com/ray-project/kuberay/blob/9068102246eeb5ab9d9e0b9a7480618d3f348686/ray-operator/controllers/ray/common/pod.go#L222

Our assumption is that the pod goes out-of-memory in this line of the script, which tries to invoke the ray CLI:

https://github.com/ray-project/kuberay/blob/9068102246eeb5ab9d9e0b9a7480618d3f348686/ray-operator/controllers/ray/common/pod.go#L192

To get a rough estimate of the memory usage of that call, one can check with e.g.:

/usr/bin/time -l ray health-check --address localhost:1234 2>&1 | grep "resident set size"

which reveals a resident set sizes of around 180 to 190 MB. Accounting for memory usage from the system, 256 Mi may simply be not enough.

Reproduction script

It doesn't really matter, because it is Kubernetes configuration problem.

But we are basically submitting a simple hello world for testing:

import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init()
print(ray.cluster_resources())
print(ray.get(hello_world.remote()))

Anything else

How often does the problem occur?

Since the exact amount of allocated memory is non-deterministic, the error also happens non-deterministically for us. Depending on the environment, it seems to fail with different probabilities:

  • on our productive cluster it is close to 0% fortunately.
  • on our CI kind cluster it fails with ~90%.
  • on some developer machines it fail with ~100%.

We do not yet understand why the different environments have such different failure rates.

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

bluenote10 avatar Jan 13 '25 12:01 bluenote10

Thank you, @bluenote10. I think your PR works, but we can probably do better by copying the resource requests and limits from the Ray container. Would you like to explore this idea?

rueian avatar Jan 13 '25 17:01 rueian

cc @kevin85421

rueian avatar Jan 13 '25 17:01 rueian

we can probably do better by copying the resource requests and limits from the Ray container

I was wondering about that as well, but concluded that the memory requirements of that wait-gcs-ready init container is quite different from the regular container, right? Essentially when setting the main container itself to a large GB number, the init-container would unnecessarily need to allocate much more memory then really needed. It seems to make some sense to decouple the requirements of the init container and the main container, if I understand it correctly.

bluenote10 avatar Jan 13 '25 20:01 bluenote10

Essentially when setting the main container itself to a large GB number, the init-container would unnecessarily need to allocate much more memory then really needed.

That's correct. The wait-gcs-ready init container is definitely lighter than the actual ray container. But as far as I know, according to https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#resource-sharing-within-containers, it is safe to copy resource requests and limits from the Ray container to init containers because it won't change the effective requests/limits.

rueian avatar Jan 13 '25 20:01 rueian

Hmm, health-check just sends a gRPC request to GCS IIRC. If it uses that many resources, it should be a bug in Ray.

kevin85421 avatar Jan 13 '25 23:01 kevin85421

@bluenote10 which Ray version do you use and what's your K8s env (e.g. EKS? GKE? K8s version?)?

kevin85421 avatar Jan 15 '25 18:01 kevin85421

@kevin85421 We are experiencing this mainly using kind on local developer machines and CI runners. The Kubernetes version is 1.30.6.

bluenote10 avatar Jan 20 '25 10:01 bluenote10

@kevin85421 Replicated using Minikube as well

yoschihirokaimoto avatar Jul 01 '25 00:07 yoschihirokaimoto

This problem could easily happen when we include Ray within a large bazelized container. In this case, the wait-gcs-ready init container could easily run into OOM with only 256MB memory limit.

I prefer to introduce InitContainerOptions just like autoscaleroptions to offer the ability to set the memory limit, and I am preparing a PR to address this.

Myasuka avatar Nov 24 '25 16:11 Myasuka