feat: enable streaming usage metrics for OpenAI-compatible providers
What does this PR do?
Injects stream_options={"include_usage": True} for OpenAI-compatible providers when streaming and telemetry is active. This allows token usage metrics to be collected and emitted for streaming responses.
Changes include:
- Injecting
stream_optionsinOpenAIMixin(completion & chat) when tracing is enabled - Adding metric emission logic for completion streaming in
InferenceRouter - Removing duplicate logic from WatsonX and Runpod providers
Closes https://github.com/llamastack/llama-stack/issues/3981
Test Plan
Added unit tests in tests/unit/providers/utils/inference/test_openai_mixin.py verifying:
Ran tests locally:
PYTHONPATH=src pytest tests/unit/providers/utils/inference/test_openai_mixin.py -v
This pull request has merge conflicts that must be resolved before it can be merged. @skamenan7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
I think #4127 should supersede this probably, right?
This pull request has merge conflicts that must be resolved before it can be merged. @skamenan7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Yes, Charlie, even with such overhaul of the telemetry (Thanks @iamemilio ), OpenAI provider will not send usage data unless we explicitly ask for it with stream_options={"include_usage": True} as Automatic instrumentation libraries usually do not modify your API payloads to ask for extra data.
@skamenan7 Sorry to say, my changes are going to make a bit of work for you. I would really suggest working in the bounds of the changes I made and experimenting with automatic instrumentation from opentelemetry, because tokens are something it actively captures. That said, you are correct that tokens are not included in the payloads from streaming data unless you set it in the arguments. Please do experiment with my PR and see what has changed, the old telemetry system you are using is going to be removed soon. If llama stack wanted to enable token metrics to all the services it routes inference streaming requests too, that is a clever solution, but we also need to make sure we are respecting the client's preferences and not returning the token metrics chunk if they did not enable it. I'm happy to help if you need it!
I think we are doing this backwards @mattf. @skamenan7 and I are going to pair on this to position this change as a follow up PR to #4127.
I can not keep increasing the complexity of what is already an egregiously large pull request otherwise it will be too difficult to review and test. Having to handle this would be a major time sink and setback for me.
Thanks @mattf for that catch. I updated Bedrock as well.
Yes, me and Emilio are meeting soon to discuss but I made the updates so as not to forget.
cc: @leseb @rhuss
This pull request has merge conflicts that must be resolved before it can be merged. @skamenan7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@skamenan7 please push again so the CI can run :) thanks
sure, @leseb . Because of the major change of Telemetry, I am planning to close this PR and submit another one off of rebased code and overlaying my changes on top of it. Thanks.
I have opened this PR instead - https://github.com/llamastack/llama-stack/pull/4326