Speed up CI
It currently takes ~1 hour to push a PR through CI. Roughly this means:
| Action | Stage | Avg. Time |
|---|---|---|
| VM provisioning | metrics | 1-10 min |
| VM startup | metrics | 0.5 min |
| Container build (each) | metrics | 5-10 min |
docker-compose up |
metrics | 1-5 min |
| init job wait | metrics | 2-20 min |
| Smoke tests | metrics | 1 min |
| Tempest tests | metrics | 10 min |
| VM provisioning | logs | 1-10 min |
| VM startup | logs | 0.5 min |
| Container build (each) | logs | 5-10 min |
docker-compose up |
logs | 1-5 min |
| init job wait | logs | 2-20 min |
Guesstimates:
- Best case: 30 minutes
- Worst case: 1 hour 42 minutes (though probably more like 60 minutes)
Unfortunately neither Travis nor GitHub seem to record the real-world time jobs actually take (including provisioning delays) so I had to guess a bit. But, going through the process of merging a PR (approve, update to latest master, wait on CI, merge) takes at least an hour, sometimes more since I actually have to remember to check on it later.
Some ideas to speed this up (definitely open to suggestions):
- Move tempest tests to nightly cron job, automatically file GitHub issues when failed (as mentioned in #200) - saves 10 minutes from best case
- smoke tests are (hopefully?) sufficient for per-patch validation
- Don't use travis stages so we only incur VM provisioning time once (we can just run
ci.pytwice) - saves up to 10 minutes - Run multiple container builds in parallel (pass
-w 2to dbuild) - decrease init job max wait time from 20 minutes to something more reasonable (5?)
With these changes the new rough times would be:
- Best case: 18.5 min
- Worst case: 51.5 min
Hmm...I didn't catch that issue and I have submitted #227 Perhaps it will help though.
There's one thing I picked up during my own time spent with ci.py. In overall, I'd stay with stages but I'd try to move to a solution which builds all dirty modules prior to any tests (i.e. first stage, or one of the first stages).
The benefit is that we could make entire CI a much reliable and maintainable. It starts to grow a bit large to be able to understand it properly and extend. The drawback is that stages are independent so would have to push the temporary images to the hub. However we can also drop them afterwards.
Just to give you a hint, my idea was to have:
- stage for linting, petty much @matrixik activities
- stage for all modules that are being built regardless of the pipeline
- all the tests required
- finally publishing the modules (if the PR is merged or other conditions).
All that is driven by the fact that recent changes has reveled that we've kind of introduced a lot of activities in CI that increased the exec time. Might idea might not necessarily result in speeding things up, but I do strongly believe that simple or rather modular setup is something monasca-docker should move forward too. Plus we would gain a little bit more in the are of logging all that up as the logs for all the stages wouldn't be mixed up together with other. That is somehow addressed in #225 but my idea does not necessarily conflict with #225. I'd say both complete each other.
Update: There's a possibility to save & load (docker save and docker load) of the images, so might do that to save results of image build from possible step 2 for later steps.
There's also the docker integration for the github, but there's I guess just as if you're repo contained a project that was built into the image. So if this repo contains multiple images Dockerfiles I don't think it will work out.
I agree that travis's stages could be a lot simpler, but unfortunately they can be very slow. Even during quiet periods it takes 3-5 minutes for our code to actually begin executing for each stage, but when Travis is under heavier load (like yesterday morning) we were waiting more like 10 minutes for a job to start. And that's per stage, so on busy days with 2 stages we can potentially sit around waiting for 20 minutes just to run 15 minutes of tests.
Given human delays in responding to patches CI results the turnaround time ends up being about an hour. Practically speaking this means we can review and merge at most 8 patches in a given work day, even if they're trivial 1-line patches. In practice it's probably more like 5-6, though.
(the startup time is mostly from us needing sudo-capable VMs to use docker, Travis's container-based builds like we used in monasca-helm are much quicker, usually 30-60 seconds)
Ideally I'd like to see the majority of jobs complete basic CI complete in 15 minutes (if not less). And really, we're very close to that already, at least in terms of the code we actually run ourselves.
A more modular setup is a great idea, and I think we could implement that in a way that doesn't incur overhead for each build "stage". Actually running on a fresh VM (or even container) for each stage doesn't seem beneficial when we can just run tasks sequentially ourselves and incur no extra startup overhead. A side benefit to this is that we don't need to worry about transporting any built images between stages, since they're already present on the local system after being built.