kubebench Support scalability tests

The Kubebench should be extended to support scalability tests in 2 ways:

run a job with many workers
run many jobs in parallel

It should also be able to collect metrics during the runs and make it easy to analyze the results, refer to #124 for this.

Also related with https://github.com/kubeflow/tf-operator/issues/830

Oct 09 '18 23:10 xyhuang

/priority p1

Oct 09 '18 23:10 xyhuang

/priority p2

Nov 04 '18 00:11 jbottum

No eng resource assigned, priority 2

@xyhuang are you planning to take this on? if so, if you can assign yourself and please tag us so that PMs can add this to the kanban board

Nov 06 '18 00:11 chrisheecho

/remove-priority p1

Nov 06 '18 00:11 chrisheecho

@chrisheecho sorry for the late response. I will try to implement this if I got time, but we might likely move it to 0.5, I agree to keep it as p2 for now.

Nov 17 '18 14:11 xyhuang

@xyhuang @swiftdiaries

Let's try to have this for 0.5.

A few things to consider:

How should we automate this? I think it makes sense to create a periodic Prow workflow that runs this daily. We should use a separate GCP project so that the scale tests don't interfere with regular presubmits.
What tests should we run? As a starting point we can consider these items for tf-operator:

Lots of workers: https://github.com/kubeflow/tf-operator/issues/830
Lots of concurrent jobs: https://github.com/kubeflow/tf-operator/issues/829

What would it take for kubebench to support these?

Collecting metrics

Error rates
Latency (how fast does each pending job get processed? And how long does it take for a worker to start?)
Throughput (how many jobs/workers can we run concurrently?)
CPU/memory usage of the operator

Dashboard We currently have a kubebench-dashboard. Can we use it to track load test results?

Jan 08 '19 00:01 richardsliu

@richardsliu here are some quick answers, we will discuss in more details:

1 & 2 agree, i will create a few issues to track required changes, they should be doable within 0.5. 3 today benchmark metrics/results are collected in 2 ways: (1) sending run-time metrics through monitoring infra (e.g. prometheus) (2) collecting job performance numbers through a user-defined "postjob", which interprets the outputs from tf jobs. if required info can be collected in one of these ways it should be easy, else we will figure it out. 4 possible with some changes. currently we use a on-prem backend for result storage/viz, for tracking results in prow tests it's probably easier to leverage gcloud resources (bigtable? stackdriver?), that should be supported with a few small changes.

that being said, here are the needed things in my mind:

create prow workflow (for 1)
add support for benchmarking concurrent jobs in a single kubebench job (for 2)
minor workload improvements (yaml configs + minor code changes if needed) for actual tests (for 2)
support collecting required metrics (for 3)
leverage gcloud backend for result storage and viz (for 4)

Jan 08 '19 06:01 xyhuang

Let's split up the work. I can take care of item 1 (set up project, cluster, and Prow workflow).

Jan 08 '19 16:01 richardsliu

kubebench kubebench copied to clipboard

Support scalability tests

kubebench
kubebench copied to clipboard