Observability: Add Additional Metrics to Llama Stack Telemetry
🚀 Describe the new functionality needed
📊 Current Metrics
Llama Stack currently reports basic token metrics for inference:
llama_stack_prompt_tokens_total- Input tokensllama_stack_completion_tokens_total- Output tokensllama_stack_tokens_total- Total tokens
🎯 Proposed Additional Metrics by API
1. API Gateway Metrics (All APIs)
# Request-level metrics for all APIs
llama_stack_requests_total{api="inference",status="success"} 1234
llama_stack_requests_total{api="inference",status="error"} 56
llama_stack_request_duration_seconds{api="inference",quantile="0.95"} 0.456
llama_stack_concurrent_requests{api="inference"} 5
# Provider routing metrics
llama_stack_provider_requests_total{api="inference",provider="openai"} 789
llama_stack_provider_requests_total{api="inference",provider="vllm"} 445
llama_stack_provider_errors_total{api="inference",provider="openai",error_type="rate_limit"} 12
# Rate limiting and quota
llama_stack_rate_limit_hits_total{api="inference",user_id="user123"} 3
llama_stack_quota_exceeded_total{api="inference",user_id="user123"} 1
2. Inference API Metrics
# Enhanced token metrics (current + new)
llama_stack_prompt_tokens_total{model_id="llama-3.2-3b",provider="openai"} 1234
llama_stack_completion_tokens_total{model_id="llama-3.2-3b",provider="openai"} 567
llama_stack_tokens_total{model_id="llama-3.2-3b",provider="openai"} 1801
# Performance metrics
llama_stack_time_to_first_token_seconds{model_id="llama-3.2-3b",provider="openai",quantile="0.95"} 0.123
llama_stack_tokens_per_second{model_id="llama-3.2-3b",provider="openai",quantile="0.5"} 45.6
llama_stack_inference_duration_seconds{model_id="llama-3.2-3b",provider="openai",quantile="0.95"} 2.34
# Model-specific metrics
llama_stack_model_requests_total{model_id="llama-3.2-3b",model_type="llm"} 1234
llama_stack_model_requests_total{model_id="all-MiniLM-L6-v2",model_type="embedding"} 567
llama_stack_model_errors_total{model_id="llama-3.2-3b",error_type="model_unavailable"} 2
# Streaming metrics
llama_stack_streaming_requests_total{model_id="llama-3.2-3b"} 234
llama_stack_streaming_chunks_total{model_id="llama-3.2-3b"} 1234
3. Vector I/O API Metrics
# Vector operations
llama_stack_vector_inserts_total{vector_db="chromadb",operation="chunks"} 1234
llama_stack_vector_queries_total{vector_db="chromadb",operation="search"} 567
llama_stack_vector_deletes_total{vector_db="chromadb",operation="chunks"} 89
# Vector performance
llama_stack_vector_query_duration_seconds{vector_db="chromadb",quantile="0.95"} 0.234
llama_stack_vector_insert_duration_seconds{vector_db="chromadb",quantile="0.95"} 1.23
llama_stack_vector_chunks_processed_total{vector_db="chromadb"} 5678
# Vector store metrics
llama_stack_vector_stores_total{provider="chromadb"} 12
llama_stack_vector_files_total{vector_db="chromadb"} 45
llama_stack_vector_dimensions{vector_db="chromadb"} 384
4. Safety API Metrics
# Safety checks
llama_stack_safety_checks_total{shield_id="llama-guard",status="passed"} 1234
llama_stack_safety_checks_total{shield_id="llama-guard",status="blocked"} 56
llama_stack_safety_violations_total{shield_id="llama-guard",category="violence"} 12
llama_stack_safety_violations_total{shield_id="llama-guard",category="hate"} 3
# Safety performance
llama_stack_safety_check_duration_seconds{shield_id="llama-guard",quantile="0.95"} 0.045
llama_stack_safety_false_positives_total{shield_id="llama-guard"} 2
5. Agents API Metrics
# Agent workflows
llama_stack_agent_workflows_total{agent_id="meta-reference",status="completed"} 123
llama_stack_agent_workflows_total{agent_id="meta-reference",status="failed"} 12
llama_stack_agent_steps_total{agent_id="meta-reference"} 456
# Agent performance
llama_stack_agent_workflow_duration_seconds{agent_id="meta-reference",quantile="0.95"} 15.6
llama_stack_agent_tool_calls_total{agent_id="meta-reference",tool="web_search"} 89
llama_stack_agent_tool_calls_total{agent_id="meta-reference",tool="rag"} 67
6. Evaluation API Metrics
# Evaluation jobs
llama_stack_eval_jobs_total{benchmark_id="basic",status="completed"} 45
llama_stack_eval_jobs_total{benchmark_id="basic",status="failed"} 3
llama_stack_eval_duration_seconds{benchmark_id="basic",quantile="0.95"} 120.5
# Scoring functions
llama_stack_scoring_calls_total{scoring_fn="llm-as-judge",status="success"} 234
llama_stack_scoring_calls_total{scoring_fn="llm-as-judge",status="error"} 12
llama_stack_scoring_duration_seconds{scoring_fn="llm-as-judge",quantile="0.95"} 2.34
7. Tool Runtime API Metrics
# Tool invocations
llama_stack_tool_invocations_total{tool_group="websearch",tool="brave-search",status="success"} 123
llama_stack_tool_invocations_total{tool_group="websearch",tool="tavily-search",status="error"} 5
llama_stack_tool_invocations_total{tool_group="rag",tool="rag-runtime",status="success"} 89
# Tool performance
llama_stack_tool_duration_seconds{tool_group="websearch",tool="brave-search",quantile="0.95"} 1.23
llama_stack_tool_duration_seconds{tool_group="rag",tool="rag-runtime",quantile="0.95"} 0.456
# RAG-specific metrics
llama_stack_rag_queries_total{vector_db="chromadb"} 67
llama_stack_rag_documents_retrieved_total{vector_db="chromadb"} 234
llama_stack_rag_inserts_total{vector_db="chromadb"} 45
8. Dataset I/O API Metrics
# Dataset operations
llama_stack_dataset_registrations_total{provider="huggingface",status="success"} 23
llama_stack_dataset_registrations_total{provider="localfs",status="success"} 12
llama_stack_dataset_rows_processed_total{dataset_id="test-dataset"} 1234
# Dataset performance
llama_stack_dataset_iteration_duration_seconds{dataset_id="test-dataset",quantile="0.95"} 5.67
llama_stack_dataset_chunk_size{dataset_id="test-dataset"} 100
🔍 Industry Standards Comparison
Similar to metrics for inference offered by:
- Anthropic: https://docs.anthropic.com/en/docs/claude-code/monitoring-usage#interpreting-metrics-and-events-data
- Gemini: https://cloud.google.com/gemini/docs/monitor-gemini
- Sambanova: https://docs.sambanova.ai/sambastudio/latest/monitoring.html
- vLLM: https://docs.vllm.ai/en/latest/usage/metrics.html?h=metrics
I couldn't find any other metrics for inference or other APIs. We could dig deeper into existing cloud providers, vector databases & co.
I'm going to create sub-issues for each APIs.
📋 Acceptance Criteria
- [ ] API-level request metrics for all routers
- [ ] Provider-specific performance and error tracking
- [ ] Enhanced inference metrics (time to first token, tokens/sec)
- [ ] Vector I/O operation metrics
- [ ] Safety check metrics
- [ ] Agent workflow metrics
- [ ] Evaluation and scoring metrics
- [ ] Tool runtime metrics
- [ ] Dataset operation metrics
- [ ] Updated documentation with all new metrics
- [ ] Unit tests for new metric generation
💡 Why is this needed? What if we don't build it?
Lack of observability.
Other thoughts
No response
i am interested
Thanks @leseb , that's a great collection of metrics, and thanks for doings the research of what's already out there.
This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant!
Looks like that issue dropped out. @leseb do we want record the metrics somewhere else to not get lost so that we have a good start should we come back to this topic ?
This is blocked by https://github.com/llamastack/llama-stack/issues/3806. I want to steer llama stack towards using automatic instrumentation as the default way to capture telemetry data in llama stack, and we should get that working first and asses which metrics from this list still need to be manually captured after that.
@leseb can we update this issue to represent the metrics we want to add now that various APIs have been removed and the telemetry implementation has been refactored? Thanks!
This is an important note: OpenAI clients currently have telemetry automatically due to there being an OTEL library to detect and instrument them. This is not the case for llama stack clients, and our users will be missing out on a lot of convenience because of this. This is something that may be the subject of a separate issue IMO, but I wanted to flag it here first. See: https://opentelemetry.io/docs/specs/semconv/gen-ai/openai/. There are many tools that are built to capture data from openai clients, open-telemetry, traceloop, MLFlow, langfuse, ect.