kuberay
kuberay copied to clipboard
[Feature] KubeRay Scalability Benchmarking
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
In last week's KubeRay community meeting we discussed kicking off some work to benchmark KubeRay and Ray on different aspects of scalability.
The end result should be something like:
- Create a simple tool to create Kubernetes clusters, RayCluster and run some benchmarking tests
- Published benchmark results based on the tests run
As a bonus step, it would be great to setup periodic runs of scalability tests to catch possible regressions in performance.
As a starting point I would like to propose the following metrics to measure:
- Max total Ray nodes in a single Kubernetes cluster (over X Ray clusters)
- Time to scale up a Ray cluster to Xk nodes
- Max RayJob resources (should be in the order of hundreds to a thousand depending on the size of the job)
- Some latency / QPS benchmarks for inference with RayServe
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
For reference some work has been done in this area already but it primarily focuses on memory scalability https://docs.ray.io/en/latest/cluster/kubernetes/benchmarks/memory-scalability-benchmark.html#kuberay-mem-scalability
cc @morhidi
This week is Google Cloud NEXT, but @kevin85421, @morhidi and I plan to meet some time next week to kick off this work.
If you have any ideas or feedback on what areas of scalability you would like us to test, please leave a note in this issue.