GenAIExamples icon indicating copy to clipboard operation
GenAIExamples copied to clipboard

[ChatQnA] Provide E2E performance metrics

Open eero-t opened this issue 1 year ago • 4 comments
trafficstars

Currently one can get inferencing metrics from TGI and TEI backend services, but there are no E2E metrics for the whole pipeline, e.g. what are the first response, and response continuation rates.

I think at minimum following (Prometheus) counter metrics would be needed from the ChatQnA frontend service:

  • request count
  • first tokens count for responses + sum of their duration
  • further tokens count for responses + sum of their duration

That way one can get end-to-end latencies for user request processing, averaged over any interval, both the initial response delay, and rate at which response is being completed.

This can be used to monitor the whole service, and how the actual response-time to user requests is improved with backend scaling. Contrasting E2E metrics with the backend services inferencing metrics shows whether scaling of other activity, or e.g. improving how frontend uses the (scaled) backends, needs to be improved.

PS. Same applies also to other provided example services, but for now, I care only about ChatQnA.

eero-t avatar Jul 10 '24 16:07 eero-t

Providing such metrics is straightforward.

When user query comes in:

  • Timestamp first token start time
  • Increase query_count counter

When replying with first token for that query:

  • Add diff from first token start time to first_tokens_duration counter
  • Increase first_tokens_count counter
  • Timestamp next token start time

When replying with further tokens for that query:

  • Add diff from next token start time to next_tokens_duration counter
  • Increase next_tokens_count counter
  • Timestamp next token start time

When receiving GET query for "/metrics" URL path, respond with current values for all counters:

# TYPE query_count counter - total count of end-user queries
query_count <total>
# TYPE first_tokens_count counter - total count of all first tokens
first_tokens_count <total>
# TYPE first_tokens_duration counter - sum of first token durations
first_tokens_duration <total>
# TYPE next_tokens_count counter - total count of all next tokens
next_tokens_count <total>
# TYPE next_tokens_duration counter - sum of next token durations
next_tokens_duration <total>

How to get Prometheus to scrape the metrics: https://github.com/opea-project/GenAIComps/issues/260

Note: query_count and first_tokens_count are separate counters because:

  • query may get an error instead of response tokens
  • in loaded situations there can be a long time between query being queued to backend, and it providing first token for it

eero-t avatar Jul 10 '24 16:07 eero-t

Note: metrics should have relevant prefix, e.g. chatqna_ for ChatQnA service, so they can be identified better.

eero-t avatar Jul 19 '24 19:07 eero-t

we will discuss how to implement it

kevinintel avatar Jul 25 '24 01:07 kevinintel

Note: metrics should have relevant prefix, e.g. chatqna_ for ChatQnA service, so they can be identified better.

Each metric should also have a label that identifies to which Helm release given ChatQnA instance belongs to. I.e. there should be a service option for that, which Helm can set in the service YAML.

Service's Python / HTTP metrics provided by current observability add-on: https://github.com/opea-project/GenAIInfra/tree/main/kubernetes-addons/Observability

Can tell something about the past performance:

$ curl --no-progress-meter http://$(kubectl get -o jsonpath="{.spec.clusterIP}:{.spec.ports[0].port}" svc/chatqna)/metrics | grep HELP | grep -v _created
# HELP python_gc_objects_collected_total Objects collected during gc
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# HELP python_gc_collections_total Number of times this generation was collected
# HELP python_info Python platform information
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# HELP process_resident_memory_bytes Resident memory size in bytes.
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# HELP process_open_fds Number of open file descriptors.
# HELP process_max_fds Maximum number of open file descriptors.
# HELP http_requests_total Total number of requests by method, status and handler.
# HELP http_request_size_bytes Content length of incoming requests by handler. Only value of header is respected. Otherwise ignored. No percentile calculated. 
# HELP http_response_size_bytes Content length of outgoing responses by handler. Only value of header is respected. Otherwise ignored. No percentile calculated. 
# HELP http_request_duration_highr_seconds Latency with many buckets but no API specific labels. Made for more accurate percentile calculations. 
# HELP http_request_duration_seconds Latency with only few buckets by handler. Made to be only used if aggregation by handler is important.

But they are NOT suitable for scaling. Rate of requests s just speed at which service can currently process requests. Larger number of users for the service does significantly change that.

For that, a metric on how many requests are pending in queue would be needed. And token latency info would help in tracking overall E2E performance of the whole service better, as it's the main indicator services like these are benchmarked on.

eero-t avatar Oct 04 '24 13:10 eero-t

Was more complicated than I thought, but fixed with https://github.com/opea-project/GenAIComps/pull/845

eero-t avatar Nov 04 '24 10:11 eero-t