semantic-conventions icon indicating copy to clipboard operation
semantic-conventions copied to clipboard

GenAI (LLM): how to capture streaming

Open lmolkova opened this issue 1 year ago • 3 comments
trafficstars

Some questions (and proposals) on capturing streaming LLM completions:

  1. Should the GenAI span cover the duration till the last token in case of streaming?
    • Yes, otherwise how do we capture completion, errors, usage, etc?
  2. Do we need an event when the first token comes? Or another span to capture duration-to-first token from the beginning?
    • This might be too verbose/not quite useful
  3. Do we need some indication on the span that it represents a streaming call?
  4. Do we need new metrics?
    • see https://github.com/open-telemetry/semantic-conventions/pull/1103 for server streaming metrics:
      • Time-to-first-token
      • Time-to-next-token
      • Number of active streams would also be useful - streaming seems to be quite hard and error prone and users would appreciate knowing they don't close streams, don't read them to the end, etc.
  5. What should gen_ai.client.operation.duration capture?
    • same as span: time-to-last-token

lmolkova avatar Jun 20 '24 03:06 lmolkova

Token Generation Latency is another metric that could be useful

karthikscale3 avatar Aug 22 '24 06:08 karthikscale3

time-to-first-token and time-to-next-token could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response make more sense?

Another option would be we recommend people to indicate streaming or non-streaming in the operation name, such as streaming chat for streaming and chat for non-streaming.

TaoChenOSU avatar Oct 09 '24 17:10 TaoChenOSU

time-to-first-token and time-to-next-token could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response make more sense?

good catch! maybe time-to-first-chunk and time-to-next-chunk ?

lmolkova avatar Oct 09 '24 19:10 lmolkova

  1. Should the GenAI span cover the duration till the last token in case of streaming?

    • Yes, otherwise how do we capture completion, errors, usage, etc?

There's a big problem here. Take this example code:

from openai import OpenAI
from opentelemetry import _events, _logs, trace
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
from opentelemetry.sdk._events import EventLoggerProvider
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import ConsoleLogExporter, SimpleLogRecordProcessor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)

_logs.set_logger_provider(LoggerProvider())
_logs.get_logger_provider().add_log_record_processor(
    SimpleLogRecordProcessor(ConsoleLogExporter())
)
_events.set_event_logger_provider(EventLoggerProvider())

OpenAIInstrumentor().instrument()


client = OpenAI(api_key="...")
chat_completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "2+2"}],
    stream=True,
)
for message in chat_completion:
    print(message)
    # break

Normally it prints a span and two child events. But if you uncomment the break or raise an exception there, the loop never finishes and the streaming span never ends. This means it only prints the first event with the user message, and no span.

Personally I think that missing the parent span in this case is quite concerning. Usually when there's an error in an operation, you still have a span that records the attempt. This doesn't even depend on there being an error - I imagine there are use cases where streaming stops intentionally after certain tokens are seen.

It seems to me that the span should end before streaming begins and not contain all the information that non-streaming spans are currently specified to have. This ensures that a span will at least exist and contain all the information available at the time, and the request message events will have a parent.

Information only available at the end of streaming (such as usage) would then be captured in an event/span that is created after the main span ends. Personally I'm OK with this being a child of the main span, but if children starting after parents end is considered then a span link might also work.

Optionally some or all of this data could be captured in a span that starts when streaming (e.g. the for loop above) begins. Then the duration of this inner span represents the time taken to read the streaming response.

If you agree with the general idea here then that means that at least in streaming cases the request and response messages have to live in separate signals, which would have big implications for https://github.com/open-telemetry/semantic-conventions/issues/1621.

alexmojaki avatar Feb 07 '25 12:02 alexmojaki

I'll note that the openai Python SDK has an API for streaming in a context manager where this isn't a problem:

with client.chat.completions.create(...) as chat_completion:
    for message in chat_completion:

However:

  1. I have no idea how commonly this pattern is used. I don't even know how I know it's possible. It's not documented in https://platform.openai.com/docs/api-reference/streaming or https://cookbook.openai.com/examples/how_to_stream_completions#2-how-to-stream-a-chat-completion.
  2. I don't know if users have much other incentive to use this pattern. I see it closes some resources but things probably usually work fine without it.
  3. The situation in some other languages and AI SDKs is probably worse.

alexmojaki avatar Feb 07 '25 12:02 alexmojaki

@alexmojaki Have you tried this out with http/gRPC layer instrumentation installed to see what comes out? I agree you would get a missing parent span but otherwise how is the experience?

What if instrumentations add a finally block while yielding over the context manager? My understand is at least CPython will run finally blocks on GC:

>>> def gen():
...     try:
...         while True:
...             yield 1
...     finally:
...         print("Finalized")
...
>>>
>>>
>>> g = gen()
>>> next(g)
1
>>> next(g)
1
>>> del g
Finalized

aabmass avatar Feb 13 '25 21:02 aabmass

What if instrumentations add a finally block while yielding over the context manager? My understand is at least CPython will run finally blocks on GC:

Whether you use finally or with the result is the same. Yes it will run on GC but GC is unpredictable. The behaviour of your application will suddenly mysteriously change if you accidentally introduce a circular reference. Spans will close long after they should have and create confusing durations. They're more likely to not close at all if the app crashes. If the message history is attached directly to the span as we've been discussing then a memory leak becomes much more concerning.

Anyway, you're right that it will often work somewhat, and based on the SIG meeting yesterday this isn't going to change. So instrumentations need to take care to implement this correctly. In https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/instrumentation-genai/opentelemetry-instrumentation-openai-v2/src/opentelemetry/instrumentation/openai_v2/patch.py, StreamWrapper probably needs to implement __del__.

alexmojaki avatar Feb 14 '25 11:02 alexmojaki

Another important consideration is that we may need to allow a streaming schema for events:

If we create a "chat deepseek" span in an AI Proxy API server, all stream responses would be transmitted directly from the downstream LLM to the upstream callers without additional parsing or processing of the generated content. However, if we need to capture completions as events, we must join all chunks into a single large text, which requires keeping them in memory for an extended period.

In my opinion, there is no reason for allowing instrumentation to become a potential performance bottleneck.

Cirilla-zmh avatar Feb 19 '25 07:02 Cirilla-zmh

Following up from last week's SIG call, @lmolkova you were right about OpenAI streaming each token as a separate SSE "event" (line in the response), which would generate a ton of gen_ai.choice events if not buffered.

However, VertexAI/Gemini APIs return a stream with a reasonable amount of tokens per chunk: https://github.com/open-telemetry/opentelemetry-python-contrib/commit/7fcf00e227042b1ccbcc07cf7b7ade506e07805b.

IMO it's reasonable to emit an OTel Event pur chunk for this instead of buffering it all. Especially since each chunk is a "candidate". WDYT?

aabmass avatar Mar 05 '25 16:03 aabmass

However, VertexAI/Gemini APIs return a stream with a reasonable amount of tokens per chunk: open-telemetry/opentelemetry-python-contrib@7fcf00e.

In your link, the first chunk is "This".

Please no. The number of events is already a problem (see https://github.com/open-telemetry/semantic-conventions/issues/1621) and there are plans to increase it further in https://github.com/open-telemetry/semantic-conventions/issues/1913 . This would be excessive.

alexmojaki avatar Mar 05 '25 19:03 alexmojaki

We've defined events (as they are today) to provide consistent experience across different providers and use-cases. Having prompts/completions appearing in a similar way on the telemetry regardless of streaming/non-streaming is useful.

Backends and evaluators can show/analyze this data in the same way and the fact that it was streamed has little to do with post-factum debugging experience.

So I see a huge value in having the same baseline for prompts/completions across providers and irrespective of streaming/non-streaming cases.

It does not prevent Vertex from additionally emitting even-more-verbose per-chunk events if it could be valuable to some users.

OpenAI, Anthropic, Cohere tend to return per-token events (even though it's not documented or guaranteed) and generating per-chunk event is an overkill for them.

lmolkova avatar Mar 06 '25 00:03 lmolkova

So I see a huge value in having the same baseline for prompts/completions across providers and irrespective of streaming/non-streaming cases.

Hi @lmolkova. I absolutely believe your point is correct if we consider only the semconv itself. However, this semconv would leave no choice for the implementation of instrumentation — they would inevitably have to bear the overhead of these "joiner buffers" on the collection side.

OpenAI, Anthropic, Cohere tend to return per-token events (even though it's not documented or guaranteed) and generating per-chunk event is an overkill for them.

Could we provide an alternative and define a streaming format for the event structure? This would give developers flexibility — they could aggregate the data on the client side, or they could choose to stream the events, with the latter implying that they must rely on a server-side solution that supports aggregation.

Cirilla-zmh avatar Mar 06 '25 05:03 Cirilla-zmh

Could we provide an alternative and define a streaming format for the event structure? This would give developers flexibility — they could aggregate the data on the client side, or they could choose to stream the events, with the latter implying that they must rely on a server-side solution that supports aggregation.

Here's a proposal about that: https://github.com/open-telemetry/semantic-conventions/issues/1964

Cirilla-zmh avatar Mar 06 '25 07:03 Cirilla-zmh

In your link, the first chunk is "This".

OK but only the first candidate is like that, presumably this is so the client sees time to first token. The following ones can be quite long like

different types of prompts. This comprehensive analysis will contribute valuable data for refining and enhancing the system's overall functionality and effectiveness.\n\n\nIt is crucial to remember that this test is entirely artificial and devoid of real-world consequences. Any information provided within the test should be treated as hypothetical and not indicative of genuine knowledge or expertise

Maybe I shouldn't have said "chunk". The API for streaming is literally just an array of the non streaming response type. Each "candidate" is 1:1 with gen_ai.choice events.

This would be excessive.

You said you're in favor of flattening the events to be one per part https://github.com/open-telemetry/semantic-conventions/issues/1913#issuecomment-2691137351. At least for VertexAI, I think this is in line with https://github.com/open-telemetry/semantic-conventions/issues/1913.

aabmass avatar Mar 06 '25 18:03 aabmass

My 2c and two perspectives:

In your link, the first chunk is "This".

Please no. The number of events is already a problem (see #1621) and there are plans to increase it further in #1913 . This would be excessive.

It does seem very excessive to create a separate entry for every single token in response, which is what happens with OAI, Anthropic and some others. Maximum output token counts used to be 4k, 8k, now more common are 32k, 64k. For such long outputs, this approach would blow up storage quite a lot with crazy metadata overhead, but at least instrumentation is truthful as that is what happens on the wire (and the timestamps make sense).

But then the buffer-and-wait approach has a memory use problem, because even though 64k tokens of text is not a lot of data to store in memory, when multiplied by a lot of clients it could become significant. With the recent addition of new types of streamed output "items" in OpenAI API like web search, computer use, code interpreter and other tools that can take long to process, some streaming flows will be inevitably taking long requiring instrumentation to store all the state for a longer time. Joining all chunks together would also get rid of the timestamps which might be particularly interesting for the tool items (measuring how long a websearch usually takes etc.).

stanek-michal avatar Mar 12 '25 03:03 stanek-michal

Backstory

We're so happy to see a great proposal worked out by @lmolkova and every folks in SIG: Modeling GenAI calls on telemetry. 👍

This proposal has pointed out where we capture the input/output. Also, we know how we could capture the streaming context:

  • Individual events where the contents are uploaded to a separate storage (maybe in append mode) with a reference to the content and position on the event attribute.
  • An array of timestamps on the GenAI span.
  • A distribution of the time-between-chunks on the GenAI span.
  • If retrieving a chunk has a duration, it can be represented as a span.

User Cases

Considering our user-cases, most of users are taking care of two things:

  • What on earth does LLM response to me? (Store them for evaluation or other data processing)
  • Why does my agent/workflow work so slowly? What goes wrong?

Now we have reached a preliminary agreement for the first user-case. As for the second one, the four implementations proposed above are all is ok (we chose the second one). But we make more efforts to add some extended attributes besides time_to_first_chunk and time_per_output_chunk like: streaming.chunk.total, streaming.chunk.interval.avg, streaming.chunk.interval.max, streaming.chunk.interval.min. We provide these attributes by default in any streaming call (not only LLM calls but also normal SSE/gRPC invocation between two services) so that users could easily understand which step on the data streaming goes slow.

Image

Advice

With the development of Model/MCP, capture streaming is becoming a generic requirement for any streaming calls. You couldn't implement all of streaming invocation like SSE/gRPC with the capture of every chunk/events. So add these snapshot attributes could be cheaper and more helpful.

Cirilla-zmh avatar Apr 02 '25 08:04 Cirilla-zmh