layer0 icon indicating copy to clipboard operation
layer0 copied to clipboard

Add additional cases for stress tests

Open zpatrick opened this issue 8 years ago • 4 comments

  • We should increase the total number of tasks created from 100 to 500. Possibly 1000.
  • Tasks should task some random amount of time to complete. Probably between 30 seconds to 5 minutes.
  • We should make the create task requests async, to simulate multiple sources hitting the api.
  • Instead of creating all the tasks right at the start, we should create tasks in waves. This would simulate a more realistic scenario, and would ensure the environment scalar runs properly over a long period of stress.
  • [ ] Document actual or perceived limits in a LIMITS file
  • [x] Investigate and then determine if go benchmarking is appropriate for the task
  • [x] Parameterize the benchmarking such that an environment can be created specifying the number of environments, deploys, services and load balancers
  • [x] layer0 configurations must be stood up using Terraform

zpatrick avatar May 09 '17 18:05 zpatrick

  • Investigate if Go benchmarking is useful in this context https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

diemonster avatar Aug 07 '17 19:08 diemonster

As per design meeting, it would be cool if the tests had the following stages:

  • Provisioning stage (create X amount of environments, services, etc)
  • Benchmarking stage (given X amount of environments, services, how quickly does the Layer0 API respond?)

diemonster avatar Aug 08 '17 18:08 diemonster

I picked out some values for number of environments, number of deploys, and number of services to see if I could break things, and I did.

https://gist.github.com/jparsons04/14576f5633b6e8b734f35e186e9cee4d

I ultimately interrupted the process so that I could clean up what I created.

Note that layer0 didn't allow me to l0 environment delete foo because:

$ l0 environment list
AWS Error: clusters cannot have more than 100 elements (code 'InvalidParameterException')

So I had to clean up the environments by hand in ECS until the total number of environments fell below 100.

jparsons04 avatar Aug 15 '17 15:08 jparsons04

I made some modifications to the benchmark functions that were more modest in what was being constructed: https://github.com/quintilesims/layer0/blob/232-jparsons/tests/stress/benchmarkstress_test.go

Here are the results of these tests I ran earlier today: https://gist.github.com/jparsons04/ff0c8f5690c7f686862821a0aa0c0f82

One thing to note here is that running this set of tests once created 300 ECS Task Definitions that are still considered to be in an active state: https://gist.github.com/jparsons04/e47348ddadbf62329c08a30614895618

With the observations I made during this test run, it seems that failures can occur with l0 service list if there are a sufficiently large number of active Task Definitions in scope.

It seems that if you have a large number of Task Definitions within the scope of your l0 instance, you can run into problems even standing up modest infrastructures. Here is a snippet of a run I did earlier this morning when I had about 1900 Task Definitions in an active state:

$ go test -v -debug -run nothing -bench . -timeout 5h
2017-08-15 11:32:24 [Test] DEBUG: Testing with Environments: 5, Deploys: 0, Services: 0, Command: while true ; do echo LONG RUNNING SERVICE ; sleep 5 ; done
2017-08-15 11:32:24 [Test] INFO : Running [terraform get] from cases/modules
2017-08-15 11:32:24 [Test] INFO : Running [terraform apply] from cases/modules
BenchmarkStress5Environments0Deploys0Services/ListEnvironments-8                    3     349042068 ns/op
BenchmarkStress5Environments0Deploys0Services/ListDeploys-8                         0             0 ns/op
BenchmarkStress5Environments0Deploys0Services/ListLoadBalancers-8                   5     469146050 ns/op
BenchmarkStress5Environments0Deploys0Services/ListServices-8                       10     230668960 ns/op
2017-08-15 11:34:34 [Test] INFO : Running [terraform destroy -force] from cases/modules
--- FAIL: BenchmarkStress5Environments0Deploys0Services
    layer0_test_client.go:122: EOF
    layer0_test_client.go:122: EOF
2017-08-15 11:36:41 [Test] DEBUG: Testing with Environments: 10, Deploys: 0, Services: 0, Command: while true ; do echo LONG RUNNING SERVICE ; sleep 5 ; done
2017-08-15 11:36:41 [Test] INFO : Running [terraform get] from cases/modules
2017-08-15 11:36:41 [Test] INFO : Running [terraform apply] from cases/modules
BenchmarkStress10Environments0Deploys0Services/ListEnvironments-8                   2     754963037 ns/op

That being said, the first gist I posted in this comment represents the first successful baseline test I've had stress testing the provisioning-then-benchmarking pattern.

jparsons04 avatar Aug 16 '17 00:08 jparsons04