layer0 Add additional cases for stress tests

We should increase the total number of tasks created from 100 to 500. Possibly 1000.
Tasks should task some random amount of time to complete. Probably between 30 seconds to 5 minutes.
We should make the create task requests async, to simulate multiple sources hitting the api.
Instead of creating all the tasks right at the start, we should create tasks in waves. This would simulate a more realistic scenario, and would ensure the environment scalar runs properly over a long period of stress.

[ ] Document actual or perceived limits in a LIMITS file
[x] Investigate and then determine if go benchmarking is appropriate for the task
[x] Parameterize the benchmarking such that an environment can be created specifying the number of environments, deploys, services and load balancers
[x] layer0 configurations must be stood up using Terraform

May 09 '17 18:05 zpatrick

Investigate if Go benchmarking is useful in this context https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

Aug 07 '17 19:08 diemonster

As per design meeting, it would be cool if the tests had the following stages:

Provisioning stage (create X amount of environments, services, etc)
Benchmarking stage (given X amount of environments, services, how quickly does the Layer0 API respond?)

Aug 08 '17 18:08 diemonster

I picked out some values for number of environments, number of deploys, and number of services to see if I could break things, and I did.

https://gist.github.com/jparsons04/14576f5633b6e8b734f35e186e9cee4d

I ultimately interrupted the process so that I could clean up what I created.

Note that layer0 didn't allow me to l0 environment delete foo because:

$ l0 environment list
AWS Error: clusters cannot have more than 100 elements (code 'InvalidParameterException')

So I had to clean up the environments by hand in ECS until the total number of environments fell below 100.

Aug 15 '17 15:08 jparsons04

I made some modifications to the benchmark functions that were more modest in what was being constructed: https://github.com/quintilesims/layer0/blob/232-jparsons/tests/stress/benchmarkstress_test.go

Here are the results of these tests I ran earlier today: https://gist.github.com/jparsons04/ff0c8f5690c7f686862821a0aa0c0f82

One thing to note here is that running this set of tests once created 300 ECS Task Definitions that are still considered to be in an active state: https://gist.github.com/jparsons04/e47348ddadbf62329c08a30614895618

With the observations I made during this test run, it seems that failures can occur with l0 service list if there are a sufficiently large number of active Task Definitions in scope.

It seems that if you have a large number of Task Definitions within the scope of your l0 instance, you can run into problems even standing up modest infrastructures. Here is a snippet of a run I did earlier this morning when I had about 1900 Task Definitions in an active state:

$ go test -v -debug -run nothing -bench . -timeout 5h
2017-08-15 11:32:24 [Test] DEBUG: Testing with Environments: 5, Deploys: 0, Services: 0, Command: while true ; do echo LONG RUNNING SERVICE ; sleep 5 ; done
2017-08-15 11:32:24 [Test] INFO : Running [terraform get] from cases/modules
2017-08-15 11:32:24 [Test] INFO : Running [terraform apply] from cases/modules
BenchmarkStress5Environments0Deploys0Services/ListEnvironments-8                    3     349042068 ns/op
BenchmarkStress5Environments0Deploys0Services/ListDeploys-8                         0             0 ns/op
BenchmarkStress5Environments0Deploys0Services/ListLoadBalancers-8                   5     469146050 ns/op
BenchmarkStress5Environments0Deploys0Services/ListServices-8                       10     230668960 ns/op
2017-08-15 11:34:34 [Test] INFO : Running [terraform destroy -force] from cases/modules
--- FAIL: BenchmarkStress5Environments0Deploys0Services
    layer0_test_client.go:122: EOF
    layer0_test_client.go:122: EOF
2017-08-15 11:36:41 [Test] DEBUG: Testing with Environments: 10, Deploys: 0, Services: 0, Command: while true ; do echo LONG RUNNING SERVICE ; sleep 5 ; done
2017-08-15 11:36:41 [Test] INFO : Running [terraform get] from cases/modules
2017-08-15 11:36:41 [Test] INFO : Running [terraform apply] from cases/modules
BenchmarkStress10Environments0Deploys0Services/ListEnvironments-8                   2     754963037 ns/op

That being said, the first gist I posted in this comment represents the first successful baseline test I've had stress testing the provisioning-then-benchmarking pattern.

Aug 16 '17 00:08 jparsons04