[bug] Token counts on Phoenix inconsistent vs. OpenRouter or model providers
Describe the bug I have some OpenAI-Agents-SDK Agent using OpenRouter & LiteLLM to send (i) streaming and (ii) non-streaming messages to LLM providers (e.g. OpenAI, Anthropic, etc). The token counts for each span are always the same between the LLM providers vs OpenRouter.
But for (i) streaming messages, the token counts on Phoenix are lower for every span vs. LLM providers & OpenRouter.
Is this a bug on Phoenix or is there something wrong with the implementation on my end?
For example, here's an Agent trace with token counts for each span:
This is what the actual token counts were for each span, in the same order:
| tokens_prompt | tokens_completion | tokens_total |
|---|---|---|
| 5325 | 84 | 5409 |
| 9297 | 115 | 9412 |
| 9468 | 161 | 9629 |
The Agent code is like this:
I auto-instrument the tracing with:
- openinference-instrumentation-litellm==0.1.25
- openinference-instrumentation-openai-agents==1.1.0 as I'm practically sending all LLM messages via LiteLLM (but not all via OpenRouter)
I also saw this OpenRouter tracing page, but I think this would create triple-nested spans and triple count the token usage.
Additional context
arize-phoenix-client 1.15.3
arize-phoenix-otel 0.12.1
litellm 1.72.2
openai-agents 0.2.11
openinference-instrumentation 0.1.35
opentelemetry-sdk 1.36.0
Hi @nkim500! I'm Dosu and I’m helping the openinference team.
This isn't a bug in Phoenix itself—Phoenix just reports the token counts it receives from upstream instrumentation via OpenTelemetry spans. For streaming responses, token counting can be tricky: if multiple instrumentations (like LiteLLM and OpenAI) are enabled, they may create nested spans or compete to set token count attributes, which can lead to undercounting or partial counts for streaming messages. The LiteLLM instrumentation is designed to aggregate streamed tokens and set the correct counts on the span, but this only works if the usage attribute is present and propagated correctly by the provider and instrumentation layer.
Best practice is to use only one instrumentation for LLM calls (e.g., LiteLLM or OpenAI, but not both) to avoid nested or conflicting spans and ensure accurate token counts. Issues with lower token counts in Phoenix for streaming are almost always due to how the upstream instrumentation collects and attributes token usage, not Phoenix's aggregation logic. For more details, see this relevant issue and the LiteLLM instrumentation code.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Thanks, but to be clear, this undercounting for streaming messages still happens even after removing:
- openinference-instrumentation-litellm==0.1.25
This is a known limitation with streaming: the OpenAI (and compatible) APIs do not return token usage data in streaming responses unless you explicitly request it with parameters like stream_options={"include_usage": True}. If your agent or LLM client doesn't pass this parameter, the instrumentation (including openinference-instrumentation-openai-agents) never receives token usage info and can't record accurate counts for Phoenix. This is not a Phoenix bug, but a gap in what the upstream client provides during streaming—so even with only openai-agents instrumentation, undercounting will persist unless the client is updated to request usage data in streaming mode. There is currently no workaround in OpenInference or Phoenix to estimate tokens if the provider doesn't supply them in the response. For more details, see this discussion and this explanation.
If you need accurate token counts for streaming, check if your LLM client or agent supports passing the include_usage flag (or equivalent) for streaming requests. If not, you may need to use non-streaming mode or monitor for upstream updates that add this support.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
HI @nkim500! Thanks for filing -- do you mind pasting in a code snippet so we can try to fully reproduce your example? Thanks!
Hi @nate-mar @nkim500 This turned out to be a LiteLLM-side issue, not Phoenix. For streaming calls, LiteLLM wasn’t propagating token usage to the spans consumed by OpenAI-Agents/OpenInference, so Phoenix showed lower token counts than providers/OpenRouter. Non-streaming was unaffected.
Yes, you’ll still see undercounting without the LiteLLM instrumentor because the OpenAI-Agents SDK doesn’t populate usage on streaming generations. Without openinference-instrumentation-litellm intercepting the stream and injecting usage from the final chunk, Phoenix receives no token totals for those spans, so counts look low.