vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: Expose a read-only API to check whether engine is sleeping

Open waltforme opened this issue 8 months ago • 1 comments
trafficstars

🚀 The feature, motivation and pitch

The sleep feature works great!

We can do even better by adding a read-only probe to check whether the engine is sleeping or not. Today the /sleep and /wake_up API endpoints POSTs to the API server and mutates the state of the engine. We may introduce an /is_sleeping endpoint which GETs the sleep status.

There are at least two use cases asking for this feature.

  1. We are doing orchestration and optimization among multiple instances of vLLMs, asleep or awake. We need a global state of all the instances' sleeping status. The proposed probe is necessary to compose this state, without having to be stateful, i.e. remember whether each and every engine has been put asleep.
  2. Today a sleeping engine crashes if a request is sent to it. This probe will give a good citizen peace of mind before sending a request.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

waltforme avatar Mar 05 '25 21:03 waltforme

you should be able to manage the is_sleeping state from an upper level. after you post a /sleep call, you can mark that engine as sleeping.

I don't think we need to explicitly expose this API endpoint.

youkaichao avatar Mar 06 '25 05:03 youkaichao

https://kubernetes.io/docs/reference/using-api/health-checks/

Fwiw it felt like this falls under liveness probes, and we might as well make the server k8s compatible

aarnphm avatar Mar 15 '25 11:03 aarnphm

@aarnphm Thanks! I responded in the PR.

waltforme avatar Mar 16 '25 06:03 waltforme

Hi I am trying to follow and test the sleep feature, and I think it can help to solve this requirement which is similar to your use cases as well.

When I tested, I also faced the issue that the engine crashed when sending a request to a sleep engine. (I would expect the request to be rejected and return the engine is sleeping.)

I think the /is_sleeping endpoint is good, but I think this can be refined further to have an /engine_status endpoint to know 1. engine status(sleep, running etc) and more, 2. what model is running in this engine, to get more information for orchestation.

I would like to hear your views on this. @waltforme @youkaichao

lizzzcai avatar Mar 19 '25 05:03 lizzzcai

Regarding the sleeping engine crash by sending a request, I completely agree. @njhill also talked about this here https://github.com/vllm-project/vllm/pull/14312#issuecomment-2721588578. I think the plumbing introduced by #14312 can be conveniently utilized to implement this little feature. So I gave it a try in my own fork here a couple of days ago, it seems to be working.

I can continue to finish this piece if we all like it.

waltforme avatar Mar 20 '25 03:03 waltforme

Hi @waltforme, I think returning an error when the engine is sleeping will be helpful rather than crashing. I can help to create an issue if needed. Thanks.

lizzzcai avatar Mar 22 '25 05:03 lizzzcai