ray [Serve] add debugging metrics to ray serve

Autoscaling & Capacity

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Target Replicas	`ray_serve_autoscaling_target_replicas`	The target number of replicas the autoscaler wants to reach	Critical for understanding autoscaling lag. "Why aren't we at target?" is unanswerable today.
Autoscaling Decision	`ray_serve_autoscaling_desired_replicas`	The raw decision from the autoscaling policy before bounds	Debug why autoscaler chose a certain number; identify policy misconfiguration
Total Requests (Autoscaler View)	`ray_serve_autoscaling_total_requests`	Total requests as seen by the autoscaler	Verify autoscaler's input matches expected load
Replica Autoscaling Metrics Delay	`ray_serve_autoscaling_replica_metrics_delay_ms`	Time taken for the replica metrics to be reported to controller	Verify busy controller
Handle Autoscaling Metrics Delay	`ray_serve_autoscaling_handle_metrics_delay_ms`	Time taken for the handle metrics to be reported to controller	Verify busy controller

Request Batching

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Batch Wait Time	`ray_serve_batch_wait_time_ms`	Time requests waited for batch to fill	Debug latency caused by waiting for batches
Batch Queue Length	`ray_serve_batch_queue_length`	Number of requests waiting in the batch queue	Identify batching bottleneck vs processing bottleneck
Batch Utilization	`ray_serve_batch_utilization_percent`	`actual_batch_size / max_batch_size * 100`	Tune `max_batch_size` parameter; low utilization = batch timeout too aggressive
Batches Processed	`ray_serve_batches_processed_total`	Counter of batches executed	Measure batching throughput separate from request throughput
Batch Execution Time	`ray_serve_batch_execution_time_ms`

Latency Breakdown

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Queue Wait Time	`ray_serve_queue_wait_time_ms`	Time request spent waiting in queue before assignment	Critical: Separate queueing delay from processing delay
Replica Queue length	`ray_serve_router_queue_len_guage`		Will help debug routing imbalances

Replica Health & Lifecycle

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Replica Startup Latency	`ray_serve_replica_startup_latency_ms`	Time from replica creation to ready state	Debug slow cold starts; model loading time
Replica Initialization Latency	`serve_replica_initialization_latency_ms`
Replica Reconfigure Latency	`ray_serve_replica_reconfigure_latency_ms`	Time for replica to complete reconfigure	Debug slow reconfiguration; model loading time
Health Check Latency	`ray_serve_health_check_latency_ms`	Duration of health check calls	Identify slow health checks blocking scaling
Health Check Failures	`ray_serve_health_check_failures_total`	Count of failed health checks	Early warning before replica marked unhealthy
Replica Shutdown Duration	`ray_serve_replica_shutdown_duration_ms`	Time from shutdown signal to replica fully stopped	Debug slow draining during scale-down or rolling updates

Proxy Health

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Proxy Healthy	`ray_serve_proxy_healthy`	Total number of healthy proxies in system. Tags: `node_id`, `node_ip_address`	Proxy availability
Proxy Draining State	`ray_serve_proxy_draining`	Whether proxy is draining (1=draining, 0=not). Tags: `node_id`, `node_ip_address`	Visibility during rolling updates
Routing Stats Delay	`ray_serve_routing_stats_delay_ms`	Time taken for the routing stats to get from replica to controller	Controller performance
Proxy Shutdown Duration	`ray_serve_proxy_shutdown_duration_ms`

State Timeline

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Deployment Status	`ray_serve_deployment_status`	Numeric status of deployment (0=DEPLOY_FAILED, 1=UNHEALTHY, 2=UPDATING, 3=UPSCALING, 4=DOWNSCALING, 5=HEALTHY). Tags: `deployment`, `application`	State Timeline visualization; deployment lifecycle debugging
Application Status	`ray_serve_application_status`	Numeric status of application (0=NOT_STARTED, 1=DEPLOYING, 2=DEPLOY_FAILED, 3=RUNNING, 4=UNHEALTHY, 5=DELETING). Tags: `application`	State Timeline visualization; application lifecycle debugging

Long Poll

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Long Poll Latency	`ray_serve_long_poll_latency_ms`	Time for updates to propagate from controller to clients	Debug slow config propagation; impacts autoscaling response time
Long Poll Pending Clients	`ray_serve_long_poll_pending_clients`	Number of clients waiting for updates per namespace	Identify backpressure in notification system

Dec 06 '25 04:12 abrarsheikh

looks good, two questions

what is the difference b/w ray_serve_replica_startup_latency_ms and serve_replica_initialization_latency_ms ?
i believe adding shutdown duration metrics for proxy and controller can be helpful, as we are doing it for replica - ray_serve_replica_shutdown_duration_ms , thoughts on it?

Dec 09 '25 19:12 harshit-anyscale

what is the difference b/w ray_serve_replica_startup_latency_ms and serve_replica_initialization_latency_ms ?

ray_serve_replica_startup_latency_ms is time taken for node to be provisioned(if one is not running in vm or k8s) + time taken for runtime env to be bootstrapped on the node for the actor (pip, docker image pull etc) + time taken for ray actor to be scheduled + time taken to run actor constructor.

serve_replica_initialization_latency_ms = time taken to run actor constructor

i believe adding shutdown duration metrics for proxy and controller can be helpful, as we are doing it for replica - ray_serve_replica_shutdown_duration_ms , thoughts on it?

i think proxy shutdown duration metric makes sense, will add it.

Dec 09 '25 19:12 abrarsheikh

I think it'd be useful to have more observability into why requests are routed to certain replicas. One metric that'd be useful is the request router's view of each replica's cached queue length.

Dec 09 '25 21:12 akyang-anyscale

@akyang-anyscale good idea, added ray_serve_replica_queue_len_guage. I think hanlde, deployment, replica, application as dimension make sense to me.

Dec 09 '25 22:12 abrarsheikh

For metric names -

How about renaming ray_serve_autoscaling_decision_replicas to ray_serve_autoscaling_desired_replicas to make it more clear? And ray_serve_deployment_target_replicas to ray_serve_autoscaling_policy_replicas or ray_serve_autoscaling_decision_replicas to keep it under the ray_serve_autoscaling naming convention?

Would the delay metrics also have the deployment dimension?

For - ray_serve_batch_utilization_percent can we also add ray_serve_actual_batch_size?

What does ray_serve_replica_queue_len_guage do that's different from today's running requests per replica metrics? Suggest making queue_wait_time_ms more specific to request_routing_delay_ms.

Dec 10 '25 15:12 akshay-anyscale

What does ray_serve_replica_queue_len_guage do that's different from today's running requests per replica metrics?

ray_serve_replica_queue_len_guage is the deployment request router's view of the replica, where as ray_serve_num_ongoing_requests_at_replicas is replicas view, if they drift a lot that is indicative of a issue.

For - ray_serve_batch_utilization_percent can we also add ray_serve_actual_batch_size?

Ack.

Would the delay metrics also have the deployment dimension?

Yes

How about renaming ray_serve_autoscaling_decision_replicas to ray_serve_autoscaling_desired_replicas to make it more clear? And ray_serve_deployment_target_replicas to ray_serve_autoscaling_policy_replicas or ray_serve_autoscaling_decision_replicas to keep it under the ray_serve_autoscaling naming convention?

ray_serve_deployment_target_replicas is agnostic of autoscaling, it will be emitted even when user controls through num_replicas.

I will rename ray_serve_autoscaling_decision_replicas to ray_serve_autoscaling_desired_replicas. But note that ray_serve_autoscaling_desired_replicas != ray_serve_deployment_target_replicas.

Dec 10 '25 18:12 abrarsheikh

Several of the latency/time metrics like ray_serve_routing_stats_delay_ms may be useful to package as a histogram instead of what I assume is a _sum counter - it'll ensure accurate support for histogram_quantile() and allow much clearer understanding of the latency distribution.

Dec 15 '25 20:12 csivanich