[Serve] add debugging metrics to ray serve
Autoscaling & Capacity
| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|---|---|---|---|
| Target Replicas | ray_serve_autoscaling_target_replicas |
The target number of replicas the autoscaler wants to reach | Critical for understanding autoscaling lag. "Why aren't we at target?" is unanswerable today. |
| Autoscaling Decision | ray_serve_autoscaling_desired_replicas |
The raw decision from the autoscaling policy before bounds | Debug why autoscaler chose a certain number; identify policy misconfiguration |
| Total Requests (Autoscaler View) | ray_serve_autoscaling_total_requests |
Total requests as seen by the autoscaler | Verify autoscaler's input matches expected load |
| Replica Autoscaling Metrics Delay | ray_serve_autoscaling_replica_metrics_delay_ms |
Time taken for the replica metrics to be reported to controller | Verify busy controller |
| Handle Autoscaling Metrics Delay | ray_serve_autoscaling_handle_metrics_delay_ms |
Time taken for the handle metrics to be reported to controller | Verify busy controller |
Request Batching
| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|---|---|---|---|
| Batch Wait Time | ray_serve_batch_wait_time_ms |
Time requests waited for batch to fill | Debug latency caused by waiting for batches |
| Batch Queue Length | ray_serve_batch_queue_length |
Number of requests waiting in the batch queue | Identify batching bottleneck vs processing bottleneck |
| Batch Utilization | ray_serve_batch_utilization_percent |
actual_batch_size / max_batch_size * 100 |
Tune max_batch_size parameter; low utilization = batch timeout too aggressive |
| Batches Processed | ray_serve_batches_processed_total |
Counter of batches executed | Measure batching throughput separate from request throughput |
| Batch Execution Time | ray_serve_batch_execution_time_ms |
Latency Breakdown
| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|---|---|---|---|
| Queue Wait Time | ray_serve_queue_wait_time_ms |
Time request spent waiting in queue before assignment | Critical: Separate queueing delay from processing delay |
| Replica Queue length | ray_serve_router_queue_len_guage |
Will help debug routing imbalances |
Replica Health & Lifecycle
| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|---|---|---|---|
| Replica Startup Latency | ray_serve_replica_startup_latency_ms |
Time from replica creation to ready state | Debug slow cold starts; model loading time |
| Replica Initialization Latency | serve_replica_initialization_latency_ms |
||
| Replica Reconfigure Latency | ray_serve_replica_reconfigure_latency_ms |
Time for replica to complete reconfigure | Debug slow reconfiguration; model loading time |
| Health Check Latency | ray_serve_health_check_latency_ms |
Duration of health check calls | Identify slow health checks blocking scaling |
| Health Check Failures | ray_serve_health_check_failures_total |
Count of failed health checks | Early warning before replica marked unhealthy |
| Replica Shutdown Duration | ray_serve_replica_shutdown_duration_ms |
Time from shutdown signal to replica fully stopped | Debug slow draining during scale-down or rolling updates |
Proxy Health
| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|---|---|---|---|
| Proxy Healthy | ray_serve_proxy_healthy |
Total number of healthy proxies in system. Tags: node_id, node_ip_address |
Proxy availability |
| Proxy Draining State | ray_serve_proxy_draining |
Whether proxy is draining (1=draining, 0=not). Tags: node_id, node_ip_address |
Visibility during rolling updates |
| Routing Stats Delay | ray_serve_routing_stats_delay_ms |
Time taken for the routing stats to get from replica to controller | Controller performance |
| Proxy Shutdown Duration | ray_serve_proxy_shutdown_duration_ms |
State Timeline
| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|---|---|---|---|
| Deployment Status | ray_serve_deployment_status |
Numeric status of deployment (0=DEPLOY_FAILED, 1=UNHEALTHY, 2=UPDATING, 3=UPSCALING, 4=DOWNSCALING, 5=HEALTHY). Tags: deployment, application |
State Timeline visualization; deployment lifecycle debugging |
| Application Status | ray_serve_application_status |
Numeric status of application (0=NOT_STARTED, 1=DEPLOYING, 2=DEPLOY_FAILED, 3=RUNNING, 4=UNHEALTHY, 5=DELETING). Tags: application |
State Timeline visualization; application lifecycle debugging |
Long Poll
| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|---|---|---|---|
| Long Poll Latency | ray_serve_long_poll_latency_ms |
Time for updates to propagate from controller to clients | Debug slow config propagation; impacts autoscaling response time |
| Long Poll Pending Clients | ray_serve_long_poll_pending_clients |
Number of clients waiting for updates per namespace | Identify backpressure in notification system |
looks good, two questions
- what is the difference b/w
ray_serve_replica_startup_latency_msandserve_replica_initialization_latency_ms? - i believe adding shutdown duration metrics for proxy and controller can be helpful, as we are doing it for replica -
ray_serve_replica_shutdown_duration_ms, thoughts on it?
- what is the difference b/w
ray_serve_replica_startup_latency_msandserve_replica_initialization_latency_ms?
ray_serve_replica_startup_latency_ms is time taken for node to be provisioned(if one is not running in vm or k8s) + time taken for runtime env to be bootstrapped on the node for the actor (pip, docker image pull etc) + time taken for ray actor to be scheduled + time taken to run actor constructor.
serve_replica_initialization_latency_ms = time taken to run actor constructor
- i believe adding shutdown duration metrics for proxy and controller can be helpful, as we are doing it for replica -
ray_serve_replica_shutdown_duration_ms, thoughts on it?
i think proxy shutdown duration metric makes sense, will add it.
I think it'd be useful to have more observability into why requests are routed to certain replicas. One metric that'd be useful is the request router's view of each replica's cached queue length.
@akyang-anyscale good idea, added ray_serve_replica_queue_len_guage. I think hanlde, deployment, replica, application as dimension make sense to me.
For metric names -
How about renaming ray_serve_autoscaling_decision_replicas to ray_serve_autoscaling_desired_replicas to make it more clear? And ray_serve_deployment_target_replicas to ray_serve_autoscaling_policy_replicas or ray_serve_autoscaling_decision_replicas to keep it under the ray_serve_autoscaling naming convention?
Would the delay metrics also have the deployment dimension?
For - ray_serve_batch_utilization_percent can we also add ray_serve_actual_batch_size?
What does ray_serve_replica_queue_len_guage do that's different from today's running requests per replica metrics? Suggest making queue_wait_time_ms more specific to request_routing_delay_ms.
What does ray_serve_replica_queue_len_guage do that's different from today's running requests per replica metrics?
ray_serve_replica_queue_len_guage is the deployment request router's view of the replica, where as ray_serve_num_ongoing_requests_at_replicas is replicas view, if they drift a lot that is indicative of a issue.
For - ray_serve_batch_utilization_percent can we also add ray_serve_actual_batch_size?
Ack.
Would the delay metrics also have the deployment dimension?
Yes
How about renaming ray_serve_autoscaling_decision_replicas to ray_serve_autoscaling_desired_replicas to make it more clear? And ray_serve_deployment_target_replicas to ray_serve_autoscaling_policy_replicas or ray_serve_autoscaling_decision_replicas to keep it under the ray_serve_autoscaling naming convention?
ray_serve_deployment_target_replicas is agnostic of autoscaling, it will be emitted even when user controls through num_replicas.
I will rename ray_serve_autoscaling_decision_replicas to ray_serve_autoscaling_desired_replicas. But note that ray_serve_autoscaling_desired_replicas != ray_serve_deployment_target_replicas.
Several of the latency/time metrics like ray_serve_routing_stats_delay_ms may be useful to package as a histogram instead of what I assume is a _sum counter - it'll ensure accurate support for histogram_quantile() and allow much clearer understanding of the latency distribution.