kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] Investigate slow Ray pod termination

Open DmitriGekhtman opened this issue 3 years ago • 11 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I've noticed that Ray pods often take a long time to terminate (minutes) after deleting a RayCluster CR. We should investigate why that is the case.

Reproduction script

Tear down a Ray cluster by deleting a RayCluster CR and observe the Ray pod's state, e.g. with watch -n 1 kubectl get pod. It might take a few minutes for the pod to terminate. There's no reason for a Ray pod to take so long to process sigterm and exit cleanly.

Anything else

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

DmitriGekhtman avatar Aug 24 '22 01:08 DmitriGekhtman

I don't observe this issue at the moment; it takes 30 seconds to terminate both the Ray head and worker Pods.

kevin85421 avatar Jan 30 '24 17:01 kevin85421

I guess 30 seconds is better than the minutes claimed in the issue description. But why does it take a full 30 seconds to do the termination? That sounds like the default K8s grace period, which suggests it requires a SIGKILL to pid 1 to stop a Ray pod.

DmitriGekhtman avatar Jan 30 '24 17:01 DmitriGekhtman

Probably the process running ray start --block does not clean up its child processes when it receives a SIGTERM. There's at least no SIGTERM handling that I can see in the code. I vaguely recall complaining about a similar issue to @rickyyx

DmitriGekhtman avatar Jan 30 '24 18:01 DmitriGekhtman

But why does it take a full 30 seconds to do the termination? That sounds like the default K8s grace period,

It makes sense to me. The kubectl has the default grace period 30 seconds (doc), but I am not sure whether client-go has the same behavior or not.

Probably the process running ray start --block does not clean up its child processes when it receives a SIGTERM.

I think so.

kevin85421 avatar Jan 30 '24 18:01 kevin85421

I believe ray is treating SIGTERM as expected exit codes for ray start --block, I guess it's always the SIGKILL that's killing the ray pod.

rickyyx avatar Jan 30 '24 19:01 rickyyx

I believe ray is treating SIGTERM as expected exit codes for ray start --block, I guess it's always the SIGKILL that's killing the ray pod.

Is there any handling of a SIGTERM to the "ray start --block" process itself? This is what we need for correct termination on K8s.

DmitriGekhtman avatar Jan 30 '24 19:01 DmitriGekhtman

i see, so kuberay sends a SIGTERM to the entrypoint process itself?

rickyyx avatar Jan 30 '24 21:01 rickyyx

i see, so kuberay sends a SIGTERM to the entrypoint process itself?

More or less.

Technically, Kubernetes (even more specifically, the Kubelet) sends the SIGTERM when the KubeRay operator (or any other agent) marks the pod for deletion. Then the Kubelet waits a configurable timeout, then it sends SIGKILL. I am personally a little hazy on how Kubernetes handles the non-entrypoint processes; it might depend on the choice of container runtime.

DmitriGekhtman avatar Jan 30 '24 21:01 DmitriGekhtman

I see, I did a local test - I think sending the SIGTERM to the entrypoint process (ray start --block) does exit the process. But if the SIGTERM is sent to other ray processes like raylet, that does not exit the entrypoint process.

rickyyx avatar Jan 30 '24 22:01 rickyyx

I think sending the SIGTERM to the entrypoint process (ray start --block) does exit the process

It definitely would exit the Python interpreter! But I bet Ray would continue to run after you do that.

DmitriGekhtman avatar Jan 30 '24 22:01 DmitriGekhtman

I think sending the SIGTERM to the entrypoint process (ray start --block) does exit the process

It definitely would exit the Python interpreter! But I bet Ray would continue to run after you do that.

It wasn't running on my case. Worth validating with a kuberay pod.

rickyyx avatar Jan 30 '24 22:01 rickyyx