kubebench icon indicating copy to clipboard operation
kubebench copied to clipboard

Support scalability tests

Open xyhuang opened this issue 6 years ago • 8 comments

The Kubebench should be extended to support scalability tests in 2 ways:

  • run a job with many workers
  • run many jobs in parallel

It should also be able to collect metrics during the runs and make it easy to analyze the results, refer to #124 for this.

Also related with https://github.com/kubeflow/tf-operator/issues/830

xyhuang avatar Oct 09 '18 23:10 xyhuang

/priority p1

xyhuang avatar Oct 09 '18 23:10 xyhuang

/priority p2

jbottum avatar Nov 04 '18 00:11 jbottum

No eng resource assigned, priority 2

@xyhuang are you planning to take this on? if so, if you can assign yourself and please tag us so that PMs can add this to the kanban board

chrisheecho avatar Nov 06 '18 00:11 chrisheecho

/remove-priority p1

chrisheecho avatar Nov 06 '18 00:11 chrisheecho

@chrisheecho sorry for the late response. I will try to implement this if I got time, but we might likely move it to 0.5, I agree to keep it as p2 for now.

xyhuang avatar Nov 17 '18 14:11 xyhuang

@xyhuang @swiftdiaries

Let's try to have this for 0.5.

A few things to consider:

  1. How should we automate this? I think it makes sense to create a periodic Prow workflow that runs this daily. We should use a separate GCP project so that the scale tests don't interfere with regular presubmits.

  2. What tests should we run? As a starting point we can consider these items for tf-operator:

  • Lots of workers: https://github.com/kubeflow/tf-operator/issues/830
  • Lots of concurrent jobs: https://github.com/kubeflow/tf-operator/issues/829

What would it take for kubebench to support these?

  1. Collecting metrics
  • Error rates
  • Latency (how fast does each pending job get processed? And how long does it take for a worker to start?)
  • Throughput (how many jobs/workers can we run concurrently?)
  • CPU/memory usage of the operator
  1. Dashboard We currently have a kubebench-dashboard. Can we use it to track load test results?

richardsliu avatar Jan 08 '19 00:01 richardsliu

@richardsliu here are some quick answers, we will discuss in more details:

1 & 2 agree, i will create a few issues to track required changes, they should be doable within 0.5. 3 today benchmark metrics/results are collected in 2 ways: (1) sending run-time metrics through monitoring infra (e.g. prometheus) (2) collecting job performance numbers through a user-defined "postjob", which interprets the outputs from tf jobs. if required info can be collected in one of these ways it should be easy, else we will figure it out. 4 possible with some changes. currently we use a on-prem backend for result storage/viz, for tracking results in prow tests it's probably easier to leverage gcloud resources (bigtable? stackdriver?), that should be supported with a few small changes.

that being said, here are the needed things in my mind:

  • create prow workflow (for 1)
  • add support for benchmarking concurrent jobs in a single kubebench job (for 2)
  • minor workload improvements (yaml configs + minor code changes if needed) for actual tests (for 2)
  • support collecting required metrics (for 3)
  • leverage gcloud backend for result storage and viz (for 4)

xyhuang avatar Jan 08 '19 06:01 xyhuang

Let's split up the work. I can take care of item 1 (set up project, cluster, and Prow workflow).

richardsliu avatar Jan 08 '19 16:01 richardsliu