kubebench
kubebench copied to clipboard
Support scalability tests
The Kubebench should be extended to support scalability tests in 2 ways:
- run a job with many workers
- run many jobs in parallel
It should also be able to collect metrics during the runs and make it easy to analyze the results, refer to #124 for this.
Also related with https://github.com/kubeflow/tf-operator/issues/830
/priority p1
/priority p2
No eng resource assigned, priority 2
@xyhuang are you planning to take this on? if so, if you can assign yourself and please tag us so that PMs can add this to the kanban board
/remove-priority p1
@chrisheecho sorry for the late response. I will try to implement this if I got time, but we might likely move it to 0.5, I agree to keep it as p2 for now.
@xyhuang @swiftdiaries
Let's try to have this for 0.5.
A few things to consider:
-
How should we automate this? I think it makes sense to create a periodic Prow workflow that runs this daily. We should use a separate GCP project so that the scale tests don't interfere with regular presubmits.
-
What tests should we run? As a starting point we can consider these items for tf-operator:
- Lots of workers: https://github.com/kubeflow/tf-operator/issues/830
- Lots of concurrent jobs: https://github.com/kubeflow/tf-operator/issues/829
What would it take for kubebench to support these?
- Collecting metrics
- Error rates
- Latency (how fast does each pending job get processed? And how long does it take for a worker to start?)
- Throughput (how many jobs/workers can we run concurrently?)
- CPU/memory usage of the operator
- Dashboard We currently have a kubebench-dashboard. Can we use it to track load test results?
@richardsliu here are some quick answers, we will discuss in more details:
1 & 2 agree, i will create a few issues to track required changes, they should be doable within 0.5. 3 today benchmark metrics/results are collected in 2 ways: (1) sending run-time metrics through monitoring infra (e.g. prometheus) (2) collecting job performance numbers through a user-defined "postjob", which interprets the outputs from tf jobs. if required info can be collected in one of these ways it should be easy, else we will figure it out. 4 possible with some changes. currently we use a on-prem backend for result storage/viz, for tracking results in prow tests it's probably easier to leverage gcloud resources (bigtable? stackdriver?), that should be supported with a few small changes.
that being said, here are the needed things in my mind:
- create prow workflow (for 1)
- add support for benchmarking concurrent jobs in a single kubebench job (for 2)
- minor workload improvements (yaml configs + minor code changes if needed) for actual tests (for 2)
- support collecting required metrics (for 3)
- leverage gcloud backend for result storage and viz (for 4)
Let's split up the work. I can take care of item 1 (set up project, cluster, and Prow workflow).