vllm
vllm copied to clipboard
feat: add enforce_include_usage option
Currently, when streaming the usage is always null. This prevents the usage of limits per user and is a bit unexpected as without using streaming, the usage is always returned.
This can be used, if vllm is between a router, such as vllm-router or litellm and serves many users and is important to detect abuse, divide costs etc.
Essential Elements of an Effective PR Description Checklist
- [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [ ] The test plan, such as providing test command.
- [ ] The test results, such as pasting the results comparison before and after, or e2e results
- [ ] (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model.
Purpose
Always returns tokens which can be used by various systems for:
- Detecting abuse by users
- Billing or dividing cost company internally
In addition this align the behavior with the non-streaming mode where the usage is always returned
Test Plan
- Setup this commit and run vllm
- Send the follow request and notice how usage is returned in the last segment
curl --request POST \
--url https://yourhost.example.com/llm/v1/completions \
--header 'apikey: {{token}}' \
--header 'content-type: application/json' \
--data '{
"model": "qwen3-30b-a3b",
"max_tokens": 100,
"presence_penalty": 0,
"frequency_penalty": 0,
"temperature": 0.1,
"prompt": "def write_hello():",
"stream": true
}'
data: {"id":"cmpl-ef83ad46-8ca3-49dc-8371-790f281f60a1#8733163","object":"text_completion","created":1750142471,"model":"qwen3-30b-a3b","choices":[],"usage":{"prompt_tokens":4,"total_tokens":104,"completion_tokens":100}}
Test Result
(Optional) Documentation Update
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
Local testing blocked by https://github.com/vllm-project/vllm/issues/15985
@aarnphm Thanks for the review! Let me know, if I should squash my commits or if any other changes are required!
@aarnphm Thank you! Is there a place where I could put some docs for this feature?
No need to, on https://docs.vllm.ai/en/latest/cli/index.html we mentioned for --help, which you already include the helpstring for it.
@max-wittig hello, I used the vllm serve --enable-force-include-usage parameter, but the client request still needs to include "stream_options": {"include": true} in the request body to return usage information. If the stream_options parameter is not included, it still cannot return the usage. Is this the normal behavior for the --enable-force-include-usage parameter?
@Wfd567 That is because vllm has not released a new version yet. This PR is not yet released: https://github.com/vllm-project/vllm/pull/20983