kuberay config/prometheus: add metrics exporter for workers

Why are these changes needed?

Sample configuration for exporting metrics from ray cluster workers. This works with autoscaling and should cover new workers and remove destroyed worker pods as well

The podMonitor CRD resource works in a similar way that serviceMonitor works but instead of targeting services, it targets pods.

Prometheus example after applying this manifest

ray_raylet_mem{..., container="ray-head", ...} | ...
...
ray_raylet_mem{..., container="ray-worker", ..., pod="ray-cluster-main-worker-generic-group-h2nhg",...} | ...
...

Aug 13 '22 00:08 ulfox

@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151

Aug 15 '22 17:08 Jeffwan

@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151

We currently use the worker's metrics for obersvability using grafana panels.

We check

active workers per node group to detect activity spikes on ray cluster
Scheduling status on ray workers, for example Unscheduled tasks

For example with the following query

sum(ray_scheduler_unscheduleable_tasks{ray_io_cluster="$RayCluster"}) by (Reason, pod)

We can detect waiting for resources or plasma memory spikes and then check

if it was infra related
activity related (spike)
the client is using a non-optimized code

Some additional examples of worker metrics we observe

sum(rate(ray_scheduler_failed_worker_startup_total{ray_io_cluster="$RayCluster"}[$__range])) by (Reason, pod)
sum(rate(ray_operation_run_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_operation_queue_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_grpc_server_req_handling_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
ray_object_directory_lookups{ray_io_cluster="$RayCluster"}

Ratio metrics

# Ratio of GRPC new / finished requests
sum(rate(ray_grpc_server_req_new_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod) / sum(rate(ray_grpc_server_req_finished_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)

# Ratio of directory objects added / removed
ray_object_directory_added_locations{ray_io_cluster="$RayCluster"} / ray_object_directory_removed_locations{ray_io_cluster="$RayCluster"}

# Workers memory util
(1 - (ray_node_mem_available{container="ray-worker", ray_io_cluster="$RayCluster"} / ray_node_mem_total{container="ray-worker", ray_io_cluster="$RayCluster"})) * 100

Availability metrics

# [99.9] Percentile of Workers register latency (For our cluster, this is withing the 10sec buckets)
100 * (sum(rate(ray_worker_register_time_ms_bucket{le="10000.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_worker_register_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))

# [99.9] Percentile of Process startup latency (For our cluster, this is within the 100ms bucket)
100 * (sum(rate(ray_process_startup_time_ms_bucket{le="100.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_process_startup_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))

Aug 17 '22 09:08 ulfox

@ulfox These are awesome guidances! We export the control plane grafana dashboard here. https://github.com/ray-project/kuberay/tree/master/config/grafana If the one for workers can be open sourced on your side, I think people would love it.

Aug 18 '22 17:08 Jeffwan

@Jeffwan I will provide a workers Grafana panel as well!

Aug 18 '22 20:08 ulfox