ray icon indicating copy to clipboard operation
ray copied to clipboard

[Serve] add debugging metrics to ray serve

Open abrarsheikh opened this issue 3 weeks ago • 6 comments

Autoscaling & Capacity

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Target Replicas ray_serve_autoscaling_target_replicas The target number of replicas the autoscaler wants to reach Critical for understanding autoscaling lag. "Why aren't we at target?" is unanswerable today.
Autoscaling Decision ray_serve_autoscaling_desired_replicas The raw decision from the autoscaling policy before bounds Debug why autoscaler chose a certain number; identify policy misconfiguration
Total Requests (Autoscaler View) ray_serve_autoscaling_total_requests Total requests as seen by the autoscaler Verify autoscaler's input matches expected load
Replica Autoscaling Metrics Delay ray_serve_autoscaling_replica_metrics_delay_ms Time taken for the replica metrics to be reported to controller Verify busy controller
Handle Autoscaling Metrics Delay ray_serve_autoscaling_handle_metrics_delay_ms Time taken for the handle metrics to be reported to controller Verify busy controller

Request Batching

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Batch Wait Time ray_serve_batch_wait_time_ms Time requests waited for batch to fill Debug latency caused by waiting for batches
Batch Queue Length ray_serve_batch_queue_length Number of requests waiting in the batch queue Identify batching bottleneck vs processing bottleneck
Batch Utilization ray_serve_batch_utilization_percent actual_batch_size / max_batch_size * 100 Tune max_batch_size parameter; low utilization = batch timeout too aggressive
Batches Processed ray_serve_batches_processed_total Counter of batches executed Measure batching throughput separate from request throughput
Batch Execution Time ray_serve_batch_execution_time_ms

Latency Breakdown

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Queue Wait Time ray_serve_queue_wait_time_ms Time request spent waiting in queue before assignment Critical: Separate queueing delay from processing delay
Replica Queue length ray_serve_router_queue_len_guage Will help debug routing imbalances

Replica Health & Lifecycle

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Replica Startup Latency ray_serve_replica_startup_latency_ms Time from replica creation to ready state Debug slow cold starts; model loading time
Replica Initialization Latency serve_replica_initialization_latency_ms
Replica Reconfigure Latency ray_serve_replica_reconfigure_latency_ms Time for replica to complete reconfigure Debug slow reconfiguration; model loading time
Health Check Latency ray_serve_health_check_latency_ms Duration of health check calls Identify slow health checks blocking scaling
Health Check Failures ray_serve_health_check_failures_total Count of failed health checks Early warning before replica marked unhealthy
Replica Shutdown Duration ray_serve_replica_shutdown_duration_ms Time from shutdown signal to replica fully stopped Debug slow draining during scale-down or rolling updates

Proxy Health

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Proxy Healthy ray_serve_proxy_healthy Total number of healthy proxies in system. Tags: node_id, node_ip_address Proxy availability
Proxy Draining State ray_serve_proxy_draining Whether proxy is draining (1=draining, 0=not). Tags: node_id, node_ip_address Visibility during rolling updates
Routing Stats Delay ray_serve_routing_stats_delay_ms Time taken for the routing stats to get from replica to controller Controller performance
Proxy Shutdown Duration ray_serve_proxy_shutdown_duration_ms

State Timeline

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Deployment Status ray_serve_deployment_status Numeric status of deployment (0=DEPLOY_FAILED, 1=UNHEALTHY, 2=UPDATING, 3=UPSCALING, 4=DOWNSCALING, 5=HEALTHY). Tags: deployment, application State Timeline visualization; deployment lifecycle debugging
Application Status ray_serve_application_status Numeric status of application (0=NOT_STARTED, 1=DEPLOYING, 2=DEPLOY_FAILED, 3=RUNNING, 4=UNHEALTHY, 5=DELETING). Tags: application State Timeline visualization; application lifecycle debugging

Long Poll

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Long Poll Latency ray_serve_long_poll_latency_ms Time for updates to propagate from controller to clients Debug slow config propagation; impacts autoscaling response time
Long Poll Pending Clients ray_serve_long_poll_pending_clients Number of clients waiting for updates per namespace Identify backpressure in notification system

abrarsheikh avatar Dec 06 '25 04:12 abrarsheikh

looks good, two questions

  1. what is the difference b/w ray_serve_replica_startup_latency_ms and serve_replica_initialization_latency_ms ?
  2. i believe adding shutdown duration metrics for proxy and controller can be helpful, as we are doing it for replica - ray_serve_replica_shutdown_duration_ms , thoughts on it?

harshit-anyscale avatar Dec 09 '25 19:12 harshit-anyscale

  1. what is the difference b/w ray_serve_replica_startup_latency_ms and serve_replica_initialization_latency_ms ?

ray_serve_replica_startup_latency_ms is time taken for node to be provisioned(if one is not running in vm or k8s) + time taken for runtime env to be bootstrapped on the node for the actor (pip, docker image pull etc) + time taken for ray actor to be scheduled + time taken to run actor constructor.

serve_replica_initialization_latency_ms = time taken to run actor constructor

  1. i believe adding shutdown duration metrics for proxy and controller can be helpful, as we are doing it for replica - ray_serve_replica_shutdown_duration_ms , thoughts on it?

i think proxy shutdown duration metric makes sense, will add it.

abrarsheikh avatar Dec 09 '25 19:12 abrarsheikh

I think it'd be useful to have more observability into why requests are routed to certain replicas. One metric that'd be useful is the request router's view of each replica's cached queue length.

akyang-anyscale avatar Dec 09 '25 21:12 akyang-anyscale

@akyang-anyscale good idea, added ray_serve_replica_queue_len_guage. I think hanlde, deployment, replica, application as dimension make sense to me.

abrarsheikh avatar Dec 09 '25 22:12 abrarsheikh

For metric names -

How about renaming ray_serve_autoscaling_decision_replicas to ray_serve_autoscaling_desired_replicas to make it more clear? And ray_serve_deployment_target_replicas to ray_serve_autoscaling_policy_replicas or ray_serve_autoscaling_decision_replicas to keep it under the ray_serve_autoscaling naming convention?

Would the delay metrics also have the deployment dimension?

For - ray_serve_batch_utilization_percent can we also add ray_serve_actual_batch_size?

What does ray_serve_replica_queue_len_guage do that's different from today's running requests per replica metrics? Suggest making queue_wait_time_ms more specific to request_routing_delay_ms.

akshay-anyscale avatar Dec 10 '25 15:12 akshay-anyscale

What does ray_serve_replica_queue_len_guage do that's different from today's running requests per replica metrics?

ray_serve_replica_queue_len_guage is the deployment request router's view of the replica, where as ray_serve_num_ongoing_requests_at_replicas is replicas view, if they drift a lot that is indicative of a issue.

For - ray_serve_batch_utilization_percent can we also add ray_serve_actual_batch_size?

Ack.

Would the delay metrics also have the deployment dimension?

Yes

How about renaming ray_serve_autoscaling_decision_replicas to ray_serve_autoscaling_desired_replicas to make it more clear? And ray_serve_deployment_target_replicas to ray_serve_autoscaling_policy_replicas or ray_serve_autoscaling_decision_replicas to keep it under the ray_serve_autoscaling naming convention?

ray_serve_deployment_target_replicas is agnostic of autoscaling, it will be emitted even when user controls through num_replicas.

I will rename ray_serve_autoscaling_decision_replicas to ray_serve_autoscaling_desired_replicas. But note that ray_serve_autoscaling_desired_replicas != ray_serve_deployment_target_replicas.

abrarsheikh avatar Dec 10 '25 18:12 abrarsheikh

Several of the latency/time metrics like ray_serve_routing_stats_delay_ms may be useful to package as a histogram instead of what I assume is a _sum counter - it'll ensure accurate support for histogram_quantile() and allow much clearer understanding of the latency distribution.

csivanich avatar Dec 15 '25 20:12 csivanich