vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Core] Expose API endpoint `/is_sleeping`

Open waltforme opened this issue 8 months ago • 1 comments
trafficstars

This PR exposes a read-only API to check whether the engine is sleeping. More details are documented as #14311 .

FIX #14311

waltforme avatar Mar 05 '25 21:03 waltforme

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Mar 05 '25 21:03 github-actions[bot]

The changes themselves look fine to me I'm just unsure of how commonly needed this might be (same as @youkaichao's thought), especially if we ensure that the sleep/wakeup operations are idempotent (not sure if that's currently the case but should be trivial otherwise).

  1. Today a sleeping engine crashes if a request is sent to it. This probe will give a good citizen peace of mind before sending a request.

Could we make a change to just fail the requests in this case rather than crashing the engine? That could then also serve as the probe mechanism if needed.

njhill avatar Mar 13 '25 15:03 njhill

Thanks for @njhill 's review! I absolutely agree the @njhill suggested 'fail request when sleeping' feature is good to do.

I think the probe currently implemented in the PR is necessary, even if the 'fail request when sleeping' feature is done.

We may think from a user's perspective. The user could be a person who can't remember the sleeping status for a fleet of vLLM instances, or a k8s controller that just crashed/restarted and trying to rebuild the global state. It sounds more natural to directly query an API endpoint, rather than sending an inference request to each of the vLLM instances, then observe whether each of the request fails or succeeds.

Moreover, if the inference-request-as-a-probe is sent to an awake engine, that request will be served and consumes extra resource. So IMHO, using an API endpoint is not only natural but also more efficient.

waltforme avatar Mar 13 '25 17:03 waltforme

@waltforme actually could you add a test for this? Probably just adding something to https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_sleep.py should suffice.

njhill avatar Mar 15 '25 01:03 njhill

@waltforme actually could you add a test for this? Probably just adding something to https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_sleep.py should suffice.

@njhill Absolutely. Added into the suggested file. Thanks for checking this!

waltforme avatar Mar 15 '25 10:03 waltforme

not sure if this is a standard elsewhere, but we can follow k8s health API endpoint for this fwiw. (i also responded in the ticket)

aarnphm avatar Mar 15 '25 11:03 aarnphm

not sure if this is a standard elsewhere, but we can follow k8s health API endpoint for this fwiw. (i also responded in the ticket)

@aarnphm Thanks for the point! It looks to me, however, the k8s API health endpoints expose things that are very specific to k8s. For example, I tried one of them:

$ kubectl get --raw='/readyz/poststarthook/generic-apiserver-start-informers'
ok

Would you elaborate what we want to follow, for vLLM?

waltforme avatar Mar 16 '25 06:03 waltforme

https://kubernetes.io/docs/reference/using-api/health-checks/#individual-health-checks

This is probably also related to production stack, but what I have in mind:

  • /readyz can be used to determine whether the engine is sleeping or not.
  • /livez can be used to determine where all workers are ready.

aarnphm avatar Mar 16 '25 06:03 aarnphm