semantic-conventions
semantic-conventions copied to clipboard
Proposal: chunk streaming and LLM provider latency metrics
Area(s)
area:gen-ai
Propose new conventions
The current gen_ai.server.time_to_first_token metric is useful for tracking server-side latency and llm "spin-up" but this metric is not as informative for client-side optimizations.
My thought was that when instrumenting an application using an agentic framework, it would be helpful for the framework to have appropriate telemetry to answer the following questions:
- How long was my request in transit to and from the LLM provider before I began seeing a response?
- I propose
gen_ai.client.operation.time_to_first_chunkas a client-side version of thegen_ai.server.time_to_first_tokenor time to first token (TTFT) metric. - This allows ops to measure (and ultimately optimize) overall lag/latency from the LLM providers' APIs (provisioning, message queue, etc...)
- I propose
- How many tokens per second were generated DURING GENERATION (not including resourcing, queues, and provisioning by the server.
- I propose
gen_ai.client.operation.time_per_output_chunkas a client-side version of the gen_ai.server.time_per_output_token metric. - This allows ops to measure (and ultimately optimize) LLM providers based on their speed/cost
- I propose
- How long did my request take to complete in total?
- The current gen_ai.client.operation.duration you have already implemented
For additional context, many builders are not running inference locally and likely don't have access to the server's token and chunk emission telemetry to measure directly. Considering the client-side lack of telemetry in these regards, having these metrics is valuable from an LLM ops optimization standpoint.
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.