cluster-api-provider-bringyourownhost
cluster-api-provider-bringyourownhost copied to clipboard
Investigate how fast can we get all the e2e tests to run
Describe the solution you'd like At the moment, running all the e2e tests takes 30ish mins. Can we investigate what is the bottleneck, and if it is possible to speed this up?
This is where all our e2e tests live - https://github.com/vmware-tanzu/cluster-api-provider-bringyourownhost/tree/main/test/e2e
This issue involves doing some research on what steps in the tests are consuming a lot of time and if there are ways to mitigate this.
Good to start with the quickstart tests - https://github.com/vmware-tanzu/cluster-api-provider-bringyourownhost/blob/main/test/e2e/e2e_test.go
- [ ] identify how much time each step takes (test setup, creating management cluster, docker hosts, apply cluster template, log collection, teardown)
- [ ] identify possible improvements
- [ ] the team can then review what improvements can be done in a reasonable time
- [ ] implement!
This would be an interesting investigation! 😄
Personally I have seen more speed when using more CPU and RAM - this is based on experience from TCE when running E2E tests for Docker clusters (CAPD). For other clusters like AWS, Azure, VMC(vSphere), the speed was dependent on the kind of machines we were spinning up on AWS, Azure, VMC, and required only little resources (CPU and RAM) from the host machine running the tanzu
CLI where the kind bootstrap cluster had to be run on top of Docker
Also, one tricky thing is, usually Docker runtime has all resources of the host machine but it need not be the case. So that's something to check out too. In dev machines, many devs would allocated only a part of their host resources to Docker runtime. In CI/CD environments with container support, like GitHub Actions, the VMs provide full resources to the Docker runtime usually but it's worthwhile to confirm it during investigation than assuming
It takes time to building host agent separately for every container. If we move it to suit-test.go to run it only once for entire e2e. It can save some time. My test is as followed:
Before do this:
- e2e_test: time elapse: 4m38.405165171s
- md_scale_test: time elapse: 11m2.787283875s
- byohost_reuse_test: time elapse: 7m14.318728577s
After do this:
- e2e_test: time elapse: 5m27.469585284s
- md_scale_test: time elapse: 7m26.022867913s
- byohost_reuse_test: time elapse: 6m51.386107789s
It can save more time for md_scale_test.
This is done by https://github.com/vmware-tanzu/cluster-api-provider-bringyourownhost/pull/404
I added some time check point, and got some data. The total test costs 21m25.592598031s, it included e2e_test 4m31.395224658s and reuse_test 5m43.749294255s. Not sure about the exact value of md_scale_test costs, because it report errors when stop container at the very end of test. It should more than 7m. The detail data is as followed:
setupBootstrapCluster: 1m8.094087766s initBootstrapCluster: 1m4.417632025s
e2e_test: Total: 4m31.395224658s
- BeforeEach: time elapse: 45.860970377s
- Creating byohost capacity pool: time elapse: 3.294036855s
- creating a workload cluster: time elapse: 3m52.242673924s
- dumpSpecResourcesAndCleanup: time elapse: 10.610023703s
- clean up byoh container and files: time elapse: 21.341578337s
reuse_test: Total: 5m43.749294255s
- BeforeEach: 40.732290086s
- Creating byohost capacity pool: 2.968998621s
- Creating a cluster: 2m51.500562216s
- Delete the cluster and freeing the ByoHosts: 10.060083121s
- Creating a new cluster: 2m0.959306317s
- dumpSpecResourcesAndCleanup: 10.426351336s
- clean up byoh container and files: 21.062230394s
md_scale_test: Total: (Didn’t get the value, because it report errors when stop container)
- BeforeEach: 40.594427525s
- Creating byohost capacity pool: 7.89641687s
- Creating a workload cluster: 6m1.652912815s
- Scaling the MachineDeployment out to 3: 50.432745806s
- Scaling the MachineDeployment out to 3:10.14809982s
- dumpSpecResourcesAndCleanup: 10.529640877s