[Bug] Remove compulsory `include_usage` when `stream=true` in gateway
Pull Request Description
When stream=true, OpenAI API does not require stream_options to be specified. This will work
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4o",
"stream": true,
"messages": [{"role": "user", "content": "help me write a random generator in python"}]
}'
However, currently when stream=true, AIBrix gateway specifically checks for stream_options={"include_usage":true}. This PR simply removes the check.
Note from @Jeffwan
Some features like heterogenous feature relies on the usage to be reported. We probably need some docs changes in the feature page to claim that include_usage is needed.
Related Issues
Resolves: #[Insert issue number(s)]
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]: Corrections to existing functionality[CI]: Changes to build process or CI pipeline[Docs]: Updates or additions to documentation[API]: Modifications to aibrix's API or interface[CLI]: Changes or additions to the Command Line Interface[Misc]: For changes not covered above (use sparingly)
Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
- [ ] PR title includes appropriate prefix(es)
- [ ] Changes are clearly explained in the PR description
- [ ] New and existing tests pass successfully
- [ ] Code adheres to project style and best practices
- [ ] Documentation updated to reflect changes (if applicable)
- [ ] Thorough testing completed, no regressions introduced
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.
/assign @varungup90
If the user has enabled rpm/tpm validation then we need to have include usage. To make include_usage optional will need check on whether user has enabled rpm/tpm limit check.
For futures relies on usage statistics, can we add in the documentation to ask them enable it explicitly? heterogenous feature need it as well. By default, it should be clean
@varungup90 @Jeffwan Let me know how you want me to add the checks and how to test them. I'm eager to contribute, but if it's too complicated, I can close this PR and you can open your own.
Another question. When include_usage is required, is it possible to send include_usage=true to inference pods, but the gateway will post-process the response to make it comply with include_usage=false if the request says so? Because what I'm seeing is that if AIBrix users use features that require include_usage (rpm/tpm validation and heterogeneous GPUs), the server is not exactly OpenAI-compatible?
@varungup90 could you give more suggestions on the tpm check? Let's get @gau-nernst onboard.
-
I want to understand where is the blocker if we mandate to include stream usage. For client, if they do not want to consume usage report then it is still OK to include in the request.
-
For implementation, there are two alternatives, first is to add another header same as "routing-strategy" which I feel will make input request bulky or complicated. Second option is that if user has enabled rpm/tpm validation or request tracing then mandate stream_usage check.
-
If we decide to make include_usage optional then major changes will be in HandleResponseBody to adjust for rpm/tpm check and heterogeneous tracing feature. Given the current lack of e2e test, implementation need to be done carefully.
I want to understand where is the blocker if we mandate to include stream usage.
I think the biggest issue is that it's not 100% OpenAI-compatible. Client code that does not expect include_stream=true might not work (from what I understand, there will be an extra last chunk with empty choices and not-null usage. If client code does not handle this, it may break). Actually I discovered this issue when trying to use SGLang's sglang.bench_serving for benchmarking AIBrix. Of course I could modify SGLang's specific code, but the issue regarding general client code is still there. Additionally, sometimes it's not possible to modify client code.
From OpenAI doc https://platform.openai.com/docs/api-reference/chat/create
If set, an additional chunk will be streamed before the data: [DONE] message. The usage field on this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value.
Perhaps another option is to always send include_stream=true to inference pods (vLLM), but the gateway may skip the last usage statistics chunk if the client does not request it?
I have started a PR to make include_usage as optional param by default. If user's TPM limit is configured then include_usage is required.
Heterogenous use case is not supported with streaming right now. Once the feature is added, include_usage should be enabled as well.