llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

feat: enable streaming usage metrics for OpenAI-compatible providers

Open skamenan7 opened this issue 2 months ago • 9 comments

What does this PR do?

Injects stream_options={"include_usage": True} for OpenAI-compatible providers when streaming and telemetry is active. This allows token usage metrics to be collected and emitted for streaming responses.

Changes include:

  • Injecting stream_options in OpenAIMixin (completion & chat) when tracing is enabled
  • Adding metric emission logic for completion streaming in InferenceRouter
  • Removing duplicate logic from WatsonX and Runpod providers

Closes https://github.com/llamastack/llama-stack/issues/3981

Test Plan

Added unit tests in tests/unit/providers/utils/inference/test_openai_mixin.py verifying:

Ran tests locally: PYTHONPATH=src pytest tests/unit/providers/utils/inference/test_openai_mixin.py -v

skamenan7 avatar Nov 19 '25 22:11 skamenan7

This pull request has merge conflicts that must be resolved before it can be merged. @skamenan7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 19 '25 22:11 mergify[bot]

I think #4127 should supersede this probably, right?

cdoern avatar Nov 19 '25 22:11 cdoern

This pull request has merge conflicts that must be resolved before it can be merged. @skamenan7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 19 '25 22:11 mergify[bot]

Yes, Charlie, even with such overhaul of the telemetry (Thanks @iamemilio ), OpenAI provider will not send usage data unless we explicitly ask for it with stream_options={"include_usage": True} as Automatic instrumentation libraries usually do not modify your API payloads to ask for extra data.

skamenan7 avatar Nov 20 '25 13:11 skamenan7

@skamenan7 Sorry to say, my changes are going to make a bit of work for you. I would really suggest working in the bounds of the changes I made and experimenting with automatic instrumentation from opentelemetry, because tokens are something it actively captures. That said, you are correct that tokens are not included in the payloads from streaming data unless you set it in the arguments. Please do experiment with my PR and see what has changed, the old telemetry system you are using is going to be removed soon. If llama stack wanted to enable token metrics to all the services it routes inference streaming requests too, that is a clever solution, but we also need to make sure we are respecting the client's preferences and not returning the token metrics chunk if they did not enable it. I'm happy to help if you need it!

iamemilio avatar Nov 20 '25 14:11 iamemilio

I think we are doing this backwards @mattf. @skamenan7 and I are going to pair on this to position this change as a follow up PR to #4127.

I can not keep increasing the complexity of what is already an egregiously large pull request otherwise it will be too difficult to review and test. Having to handle this would be a major time sink and setback for me.

iamemilio avatar Nov 20 '25 16:11 iamemilio

Thanks @mattf for that catch. I updated Bedrock as well.

skamenan7 avatar Nov 20 '25 18:11 skamenan7

Yes, me and Emilio are meeting soon to discuss but I made the updates so as not to forget.

skamenan7 avatar Nov 20 '25 18:11 skamenan7

cc: @leseb @rhuss

skamenan7 avatar Nov 20 '25 20:11 skamenan7

This pull request has merge conflicts that must be resolved before it can be merged. @skamenan7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Dec 02 '25 05:12 mergify[bot]

@skamenan7 please push again so the CI can run :) thanks

leseb avatar Dec 02 '25 08:12 leseb

sure, @leseb . Because of the major change of Telemetry, I am planning to close this PR and submit another one off of rebased code and overlaying my changes on top of it. Thanks.

skamenan7 avatar Dec 03 '25 13:12 skamenan7

I have opened this PR instead - https://github.com/llamastack/llama-stack/pull/4326

skamenan7 avatar Dec 05 '25 15:12 skamenan7