monasca-docker Speed up CI

It currently takes ~1 hour to push a PR through CI. Roughly this means:

Action	Stage	Avg. Time
VM provisioning	metrics	1-10 min
VM startup	metrics	0.5 min
Container build (each)	metrics	5-10 min
`docker-compose up`	metrics	1-5 min
init job wait	metrics	2-20 min
Smoke tests	metrics	1 min
Tempest tests	metrics	10 min
VM provisioning	logs	1-10 min
VM startup	logs	0.5 min
Container build (each)	logs	5-10 min
`docker-compose up`	logs	1-5 min
init job wait	logs	2-20 min

Guesstimates:

Best case: 30 minutes
Worst case: 1 hour 42 minutes (though probably more like 60 minutes)

Unfortunately neither Travis nor GitHub seem to record the real-world time jobs actually take (including provisioning delays) so I had to guess a bit. But, going through the process of merging a PR (approve, update to latest master, wait on CI, merge) takes at least an hour, sometimes more since I actually have to remember to check on it later.

Some ideas to speed this up (definitely open to suggestions):

Move tempest tests to nightly cron job, automatically file GitHub issues when failed (as mentioned in #200) - saves 10 minutes from best case
- smoke tests are (hopefully?) sufficient for per-patch validation
Don't use travis stages so we only incur VM provisioning time once (we can just run ci.py twice) - saves up to 10 minutes
Run multiple container builds in parallel (pass -w 2 to dbuild)
decrease init job max wait time from 20 minutes to something more reasonable (5?)

With these changes the new rough times would be:

Best case: 18.5 min
Worst case: 51.5 min

Sep 18 '17 20:09 timothyb89

Hmm...I didn't catch that issue and I have submitted #227 Perhaps it will help though.

Sep 19 '17 09:09 kornicameister

There's one thing I picked up during my own time spent with ci.py. In overall, I'd stay with stages but I'd try to move to a solution which builds all dirty modules prior to any tests (i.e. first stage, or one of the first stages).

The benefit is that we could make entire CI a much reliable and maintainable. It starts to grow a bit large to be able to understand it properly and extend. The drawback is that stages are independent so would have to push the temporary images to the hub. However we can also drop them afterwards.

Just to give you a hint, my idea was to have:

stage for linting, petty much @matrixik activities
stage for all modules that are being built regardless of the pipeline
all the tests required
finally publishing the modules (if the PR is merged or other conditions).

All that is driven by the fact that recent changes has reveled that we've kind of introduced a lot of activities in CI that increased the exec time. Might idea might not necessarily result in speeding things up, but I do strongly believe that simple or rather modular setup is something monasca-docker should move forward too. Plus we would gain a little bit more in the are of logging all that up as the logs for all the stages wouldn't be mixed up together with other. That is somehow addressed in #225 but my idea does not necessarily conflict with #225. I'd say both complete each other.

Update: There's a possibility to save & load (docker save and docker load) of the images, so might do that to save results of image build from possible step 2 for later steps.

Sep 19 '17 09:09 kornicameister

There's also the docker integration for the github, but there's I guess just as if you're repo contained a project that was built into the image. So if this repo contains multiple images Dockerfiles I don't think it will work out.

Sep 19 '17 09:09 kornicameister

I agree that travis's stages could be a lot simpler, but unfortunately they can be very slow. Even during quiet periods it takes 3-5 minutes for our code to actually begin executing for each stage, but when Travis is under heavier load (like yesterday morning) we were waiting more like 10 minutes for a job to start. And that's per stage, so on busy days with 2 stages we can potentially sit around waiting for 20 minutes just to run 15 minutes of tests.

Given human delays in responding to patches CI results the turnaround time ends up being about an hour. Practically speaking this means we can review and merge at most 8 patches in a given work day, even if they're trivial 1-line patches. In practice it's probably more like 5-6, though.

(the startup time is mostly from us needing sudo-capable VMs to use docker, Travis's container-based builds like we used in monasca-helm are much quicker, usually 30-60 seconds)

Ideally I'd like to see the majority of jobs complete basic CI complete in 15 minutes (if not less). And really, we're very close to that already, at least in terms of the code we actually run ourselves.

A more modular setup is a great idea, and I think we could implement that in a way that doesn't incur overhead for each build "stage". Actually running on a fresh VM (or even container) for each stage doesn't seem beneficial when we can just run tasks sequentially ourselves and incur no extra startup overhead. A side benefit to this is that we don't need to worry about transporting any built images between stages, since they're already present on the local system after being built.

Sep 19 '17 16:09 timothyb89