ray-llm LLM Deployment Observability

I assume, because RayLLM runs on top of Ray Serve, I can follow these steps to get observability for LLM deployments (Kuberay).

But how can we get custom metrics that are specific to LLMs, like the ones that are being suggested by the Ray team itself: https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference#benchmarking-results-for-per-token-llm-products

Nov 03 '23 10:11 roelschr

yes you should be able to setup observability using the ray serve general guides.

For the custom metrics, you can use "ray_aviary" to search the metrics (eg. if you're using grafana), we have most of the ones used in the blogs available.

Nov 06 '23 21:11 akshay-anyscale

Thanks @akshay-anyscale , found all of them very useful!

But I'm seeing some discrepancy between the times reported by ray-llm prometheus metrics and by llmperf. While evaluating a llama2-7b deployment I see llmperf reporting ITL (out): 36.62 ms/token, but in Grafana, the avg (reported with rate(ray_aviary_router_get_response_stream_per_token_latency_ms_sum[$__rate_interval])/rate(ray_aviary_router_get_response_stream_per_token_latency_ms_count[$__rate_interval])) is around 101ms.

~I suspect per_token_latency includes time for the first token. Do you have any idea why only this metric seems different?~ I see token_latency clock being reset after the first token here. But I wonder whether the time to process during yielding the first token is what may be causing this inconsistency. In any case, I've check the time with other tools besides llmperf and they all match at around 35ms.

Nov 10 '23 12:11 roelschr

Hi @roelschr I believe this is because we do some batching(upto 100ms) to make the streaming more efficient. If you make the denominator "ray_aviary_tokens_generated" instead, this should be closer to the llm perf value - the denominator would be off by 1 though because of the first token.

Nov 13 '23 23:11 akshay-anyscale

ray-llm ray-llm copied to clipboard

LLM Deployment Observability

ray-llm
ray-llm copied to clipboard