kuberay
kuberay copied to clipboard
config/prometheus: add metrics exporter for workers
Why are these changes needed?
Sample configuration for exporting metrics from ray cluster workers. This works with autoscaling and should cover new workers and remove destroyed worker pods as well
The podMonitor CRD resource works in a similar way that serviceMonitor works but instead of targeting services, it targets pods.
Prometheus example after applying this manifest
ray_raylet_mem{..., container="ray-head", ...} | ...
...
ray_raylet_mem{..., container="ray-worker", ..., pod="ray-cluster-main-worker-generic-group-h2nhg",...} | ...
...
@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151
@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151
We currently use the worker's metrics for obersvability using grafana panels.
We check
- active workers per node group to detect activity spikes on ray cluster
- Scheduling status on ray workers, for example Unscheduled tasks
For example with the following query
sum(ray_scheduler_unscheduleable_tasks{ray_io_cluster="$RayCluster"}) by (Reason, pod)
We can detect waiting for resources or plasma memory spikes and then check
- if it was infra related
- activity related (spike)
- the client is using a non-optimized code
Some additional examples of worker metrics we observe
sum(rate(ray_scheduler_failed_worker_startup_total{ray_io_cluster="$RayCluster"}[$__range])) by (Reason, pod)
sum(rate(ray_operation_run_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_operation_queue_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_grpc_server_req_handling_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
ray_object_directory_lookups{ray_io_cluster="$RayCluster"}
Ratio metrics
# Ratio of GRPC new / finished requests
sum(rate(ray_grpc_server_req_new_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod) / sum(rate(ray_grpc_server_req_finished_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
# Ratio of directory objects added / removed
ray_object_directory_added_locations{ray_io_cluster="$RayCluster"} / ray_object_directory_removed_locations{ray_io_cluster="$RayCluster"}
# Workers memory util
(1 - (ray_node_mem_available{container="ray-worker", ray_io_cluster="$RayCluster"} / ray_node_mem_total{container="ray-worker", ray_io_cluster="$RayCluster"})) * 100
Availability metrics
# [99.9] Percentile of Workers register latency (For our cluster, this is withing the 10sec buckets)
100 * (sum(rate(ray_worker_register_time_ms_bucket{le="10000.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_worker_register_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))
# [99.9] Percentile of Process startup latency (For our cluster, this is within the 100ms bucket)
100 * (sum(rate(ray_process_startup_time_ms_bucket{le="100.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_process_startup_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))
@ulfox These are awesome guidances! We export the control plane grafana dashboard here. https://github.com/ray-project/kuberay/tree/master/config/grafana If the one for workers can be open sourced on your side, I think people would love it.
@Jeffwan I will provide a workers Grafana panel as well!