kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

config/prometheus: add metrics exporter for workers

Open ulfox opened this issue 3 years ago • 1 comments
trafficstars

Why are these changes needed?

Sample configuration for exporting metrics from ray cluster workers. This works with autoscaling and should cover new workers and remove destroyed worker pods as well

The podMonitor CRD resource works in a similar way that serviceMonitor works but instead of targeting services, it targets pods.

Prometheus example after applying this manifest

ray_raylet_mem{..., container="ray-head", ...} | ...
...
ray_raylet_mem{..., container="ray-worker", ..., pod="ray-cluster-main-worker-generic-group-h2nhg",...} | ...
...

ulfox avatar Aug 13 '22 00:08 ulfox

@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151

Jeffwan avatar Aug 15 '22 17:08 Jeffwan

@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151

We currently use the worker's metrics for obersvability using grafana panels.

We check

  • active workers per node group to detect activity spikes on ray cluster
  • Scheduling status on ray workers, for example Unscheduled tasks

For example with the following query

sum(ray_scheduler_unscheduleable_tasks{ray_io_cluster="$RayCluster"}) by (Reason, pod)

We can detect waiting for resources or plasma memory spikes and then check

  • if it was infra related
  • activity related (spike)
  • the client is using a non-optimized code

Some additional examples of worker metrics we observe

sum(rate(ray_scheduler_failed_worker_startup_total{ray_io_cluster="$RayCluster"}[$__range])) by (Reason, pod)
sum(rate(ray_operation_run_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_operation_queue_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_grpc_server_req_handling_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
ray_object_directory_lookups{ray_io_cluster="$RayCluster"}

Ratio metrics

# Ratio of GRPC new / finished requests
sum(rate(ray_grpc_server_req_new_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod) / sum(rate(ray_grpc_server_req_finished_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)

# Ratio of directory objects added / removed
ray_object_directory_added_locations{ray_io_cluster="$RayCluster"} / ray_object_directory_removed_locations{ray_io_cluster="$RayCluster"}

# Workers memory util
(1 - (ray_node_mem_available{container="ray-worker", ray_io_cluster="$RayCluster"} / ray_node_mem_total{container="ray-worker", ray_io_cluster="$RayCluster"})) * 100

Availability metrics

# [99.9] Percentile of Workers register latency (For our cluster, this is withing the 10sec buckets)
100 * (sum(rate(ray_worker_register_time_ms_bucket{le="10000.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_worker_register_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))

# [99.9] Percentile of Process startup latency (For our cluster, this is within the 100ms bucket)
100 * (sum(rate(ray_process_startup_time_ms_bucket{le="100.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_process_startup_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))

ulfox avatar Aug 17 '22 09:08 ulfox

@ulfox These are awesome guidances! We export the control plane grafana dashboard here. https://github.com/ray-project/kuberay/tree/master/config/grafana If the one for workers can be open sourced on your side, I think people would love it.

Jeffwan avatar Aug 18 '22 17:08 Jeffwan

@Jeffwan I will provide a workers Grafana panel as well!

ulfox avatar Aug 18 '22 20:08 ulfox