Add additional cases for stress tests
- We should increase the total number of tasks created from 100 to 500. Possibly 1000.
- Tasks should task some random amount of time to complete. Probably between 30 seconds to 5 minutes.
- We should make the create task requests async, to simulate multiple sources hitting the api.
- Instead of creating all the tasks right at the start, we should create tasks in waves. This would simulate a more realistic scenario, and would ensure the environment scalar runs properly over a long period of stress.
- [ ] Document actual or perceived limits in a LIMITS file
- [x] Investigate and then determine if go benchmarking is appropriate for the task
- [x] Parameterize the benchmarking such that an environment can be created specifying the number of environments, deploys, services and load balancers
- [x] layer0 configurations must be stood up using Terraform
- Investigate if Go benchmarking is useful in this context https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go
As per design meeting, it would be cool if the tests had the following stages:
- Provisioning stage (create X amount of environments, services, etc)
- Benchmarking stage (given X amount of environments, services, how quickly does the Layer0 API respond?)
I picked out some values for number of environments, number of deploys, and number of services to see if I could break things, and I did.
https://gist.github.com/jparsons04/14576f5633b6e8b734f35e186e9cee4d
I ultimately interrupted the process so that I could clean up what I created.
Note that layer0 didn't allow me to l0 environment delete foo because:
$ l0 environment list
AWS Error: clusters cannot have more than 100 elements (code 'InvalidParameterException')
So I had to clean up the environments by hand in ECS until the total number of environments fell below 100.
I made some modifications to the benchmark functions that were more modest in what was being constructed: https://github.com/quintilesims/layer0/blob/232-jparsons/tests/stress/benchmarkstress_test.go
Here are the results of these tests I ran earlier today: https://gist.github.com/jparsons04/ff0c8f5690c7f686862821a0aa0c0f82
One thing to note here is that running this set of tests once created 300 ECS Task Definitions that are still considered to be in an active state: https://gist.github.com/jparsons04/e47348ddadbf62329c08a30614895618
With the observations I made during this test run, it seems that failures can occur with l0 service list if there are a sufficiently large number of active Task Definitions in scope.
It seems that if you have a large number of Task Definitions within the scope of your l0 instance, you can run into problems even standing up modest infrastructures. Here is a snippet of a run I did earlier this morning when I had about 1900 Task Definitions in an active state:
$ go test -v -debug -run nothing -bench . -timeout 5h
2017-08-15 11:32:24 [Test] DEBUG: Testing with Environments: 5, Deploys: 0, Services: 0, Command: while true ; do echo LONG RUNNING SERVICE ; sleep 5 ; done
2017-08-15 11:32:24 [Test] INFO : Running [terraform get] from cases/modules
2017-08-15 11:32:24 [Test] INFO : Running [terraform apply] from cases/modules
BenchmarkStress5Environments0Deploys0Services/ListEnvironments-8 3 349042068 ns/op
BenchmarkStress5Environments0Deploys0Services/ListDeploys-8 0 0 ns/op
BenchmarkStress5Environments0Deploys0Services/ListLoadBalancers-8 5 469146050 ns/op
BenchmarkStress5Environments0Deploys0Services/ListServices-8 10 230668960 ns/op
2017-08-15 11:34:34 [Test] INFO : Running [terraform destroy -force] from cases/modules
--- FAIL: BenchmarkStress5Environments0Deploys0Services
layer0_test_client.go:122: EOF
layer0_test_client.go:122: EOF
2017-08-15 11:36:41 [Test] DEBUG: Testing with Environments: 10, Deploys: 0, Services: 0, Command: while true ; do echo LONG RUNNING SERVICE ; sleep 5 ; done
2017-08-15 11:36:41 [Test] INFO : Running [terraform get] from cases/modules
2017-08-15 11:36:41 [Test] INFO : Running [terraform apply] from cases/modules
BenchmarkStress10Environments0Deploys0Services/ListEnvironments-8 2 754963037 ns/op
That being said, the first gist I posted in this comment represents the first successful baseline test I've had stress testing the provisioning-then-benchmarking pattern.