vllm
vllm copied to clipboard
[Feature]: Expose a read-only API to check whether engine is sleeping
🚀 The feature, motivation and pitch
The sleep feature works great!
We can do even better by adding a read-only probe to check whether the engine is sleeping or not. Today the /sleep and /wake_up API endpoints POSTs to the API server and mutates the state of the engine. We may introduce an /is_sleeping endpoint which GETs the sleep status.
There are at least two use cases asking for this feature.
- We are doing orchestration and optimization among multiple instances of vLLMs, asleep or awake. We need a global state of all the instances' sleeping status. The proposed probe is necessary to compose this state, without having to be stateful, i.e. remember whether each and every engine has been put asleep.
- Today a sleeping engine crashes if a request is sent to it. This probe will give a good citizen peace of mind before sending a request.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
you should be able to manage the is_sleeping state from an upper level. after you post a /sleep call, you can mark that engine as sleeping.
I don't think we need to explicitly expose this API endpoint.
https://kubernetes.io/docs/reference/using-api/health-checks/
Fwiw it felt like this falls under liveness probes, and we might as well make the server k8s compatible
@aarnphm Thanks! I responded in the PR.
Hi I am trying to follow and test the sleep feature, and I think it can help to solve this requirement which is similar to your use cases as well.
When I tested, I also faced the issue that the engine crashed when sending a request to a sleep engine. (I would expect the request to be rejected and return the engine is sleeping.)
I think the /is_sleeping endpoint is good, but I think this can be refined further to have an /engine_status endpoint to know 1. engine status(sleep, running etc) and more, 2. what model is running in this engine, to get more information for orchestation.
I would like to hear your views on this. @waltforme @youkaichao
Regarding the sleeping engine crash by sending a request, I completely agree. @njhill also talked about this here https://github.com/vllm-project/vllm/pull/14312#issuecomment-2721588578. I think the plumbing introduced by #14312 can be conveniently utilized to implement this little feature. So I gave it a try in my own fork here a couple of days ago, it seems to be working.
I can continue to finish this piece if we all like it.
Hi @waltforme, I think returning an error when the engine is sleeping will be helpful rather than crashing. I can help to create an issue if needed. Thanks.