kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] KubeRay Scalability Benchmarking

Open andrewsykim opened this issue 1 year ago • 3 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

In last week's KubeRay community meeting we discussed kicking off some work to benchmark KubeRay and Ray on different aspects of scalability.

The end result should be something like:

  1. Create a simple tool to create Kubernetes clusters, RayCluster and run some benchmarking tests
  2. Published benchmark results based on the tests run

As a bonus step, it would be great to setup periodic runs of scalability tests to catch possible regressions in performance.

As a starting point I would like to propose the following metrics to measure:

  • Max total Ray nodes in a single Kubernetes cluster (over X Ray clusters)
  • Time to scale up a Ray cluster to Xk nodes
  • Max RayJob resources (should be in the order of hundreds to a thousand depending on the size of the job)
  • Some latency / QPS benchmarks for inference with RayServe

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

andrewsykim avatar Apr 05 '24 18:04 andrewsykim

For reference some work has been done in this area already but it primarily focuses on memory scalability https://docs.ray.io/en/latest/cluster/kubernetes/benchmarks/memory-scalability-benchmark.html#kuberay-mem-scalability

andrewsykim avatar Apr 05 '24 19:04 andrewsykim

cc @morhidi

kevin85421 avatar Apr 05 '24 21:04 kevin85421

This week is Google Cloud NEXT, but @kevin85421, @morhidi and I plan to meet some time next week to kick off this work.

If you have any ideas or feedback on what areas of scalability you would like us to test, please leave a note in this issue.

andrewsykim avatar Apr 11 '24 13:04 andrewsykim