kuberay [Feature][RayService] Handle serve deployment delete during the cluster destroy.

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

During the cluster rotation, RayService Controller should delete the serve deployment before tearing down the whole ray cluster. This will help to make inflight requests in old ray cluster finish before destroy.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Oct 19 '22 22:10 sihanwang41

Seems reasonable to include in the release.

Nov 23 '22 04:11 DmitriGekhtman

Not able to make the change before 0.4.0 branch cut, so remove the milestone

Nov 29 '22 17:11 sihanwang41

@sihanwang41 will you take this issue? If not, this issue can help @architkulkarni familiarize himself with KubeRay.

Nov 30 '22 00:11 kevin85421

Is there any updates regarding this feature?

Jun 08 '25 06:06 manhld0206

Hi @kevin85421.

We are considering using KubeRay and Ray Serve for our production model servers. We want to have async feature. We plan to utilize FastAPI backgrounds tasks for running the heavy workload after a response has been returned. Running Ray Serve on local I can confirm that the application try to wait for all background tasks to finish before shutting down due to SIGTERM.

However with KubeRay during rolling upgrade and cluster rotation, there is no such waiting. The old cluster immediately terminated even if there is running background task.

I think this feature is the missing piece. Therefore might I ask if there is any schedule for implementing the feature or could there be any potential work around. Thank you in advance

Jul 10 '25 01:07 manhld0206

I'm willing to try providing the PR for the fix as well. But I'm gonna need some helps to start with how and where to fix.

Jul 10 '25 04:07 manhld0206

This will help to make inflight requests in old ray cluster finish before destroy.

I am not sure whether this is correct or not.

We plan to utilize FastAPI backgrounds tasks for running the heavy workload after a response has been returned.

Do you mean: the user sends a request → a Ray Serve replica triggers a heavy workload → it returns a response without waiting for the heavy workload to finish?

Jul 10 '25 05:07 kevin85421

Do you mean: the user sends a request → a Ray Serve replica triggers a heavy workload → it returns a response without waiting for the heavy workload to finish?

Yes

Jul 10 '25 06:07 manhld0206

I setup a long running endpoint (sleep for 5 minutes) and can see that the request got hang up during cluster rotation. It seems that regular requests are not drained as well. I set graceful_shutdown_timeout_s to 3000.

Jul 10 '25 07:07 manhld0206

After taking a look at the code, it seems the current logic is delete the old cluster after 60 seconds wait (ref)

1 possible fix I could think of is before adding the deletion timestamp, a serve delete API should be sent. Then before actually delete the cluster, a serve list API can be called first to make sure there is no running application. The serve delete API will handle the requests draining

Jul 10 '25 10:07 manhld0206

I setup a long running endpoint (sleep for 5 minutes) and can see that the request got hang up during cluster rotation. It seems that regular requests are not drained as well. I set graceful_shutdown_timeout_s to 3000.

RayService is designed for online services. In most cases, each request takes only 10 milliseconds to several hundred milliseconds. Hence, KubeRay deletes the old RayCluster 60 seconds after the traffic switches to the new cluster.

Are the heavy workloads separate Ray jobs? If so, I doubt that Ray Serve will drain them. Why not use RayCluster and submit jobs to the Ray dashboard?

Jul 10 '25 16:07 kevin85421

Are the heavy workloads separate Ray jobs?

No we will just launch a FastAPI background job which will call to other Ray Deployments after a response has been returned to the users. Users can later poll for the result.

For reference this RFC is what we are trying to do https://github.com/ray-project/ray/issues/32292

I doubt that Ray Serve will drain them.

The serve shutdown API will try to drain the requests before deleting all the actors. There is a timeout which can be configured (ref)

Jul 11 '25 01:07 manhld0206

@kevin85421 May I ask for your opinion regarding the serve shutdown feature for KubeRay? If it's reasonable, I can help with creating the PR for the feature.

Jul 16 '25 01:07 manhld0206

The feature makes sense to me. I can accept the feature if it doesn't add too much complexity to the codebase and it is disabled by default.

If it is too complex, you can try to increase RayClusterDeletionDelaySeconds https://github.com/ray-project/kuberay/pull/3864 as a workaround solution.

Jul 17 '25 04:07 kevin85421

Implementing the feature is not going to be easy because the during old cluster rotation, the head node service got rotated already. There are 2 ways I can think of:

Create a temporary service to point to the old cluster head node and send serve shutdown request as well as waiting for the serve shutdown to complete.
Using old cluster head pod ip instead.

Jul 21 '25 12:07 manhld0206