alpaka
alpaka copied to clipboard
Improve CI efficiency
Discussed offline with @bernhardmgruber: The current way of running the CI every time when a PR is merged is very inefficient. We just tested the code base against the PR, why test it again? This unnecessarily blocks the CI for a few hours.
Instead it might be more useful to run the develop
CI on a regular basis. Maybe once per week or once every other day. Thoughts?
The CI is running for the dev branch after a merge because a PR can be based on an old development branch. If you merge two PRs in a row we typically rebase these PRs via github to the latest development branch this could introduce issues even if there is no merge conflict. If so you know which PR is the root o the issue. It is hard to say which PR is breaking the development branch if you test the code only once per week.
To reduce the load I suggest testing for PRs and merging development fewer combinations per compiler and performing a full matrix test once per week. This should catch most issues when the PR is opened.
Another way to reduce the CI load is by staging the tests. This means you test first a handful of combinations and only on successes perform more tests. For example, this is what we do for PIConGPU: https://gitlab.com/hzdr/crp/picongpu/-/pipelines/424201783
Also a solution to #1439 might help.
for the record: PIConGPU is using a CI job generator to avoid that a combination being tested twice. https://github.com/ComputationalRadiationPhysics/picongpu/blob/dev/share/ci/n_wise_generator.py The generator is spaghetti phython but is only running valid combinations e.g. host compiler + CUDA, .... With a generator is is easy to generate different complex matrices, e.g if we do not want to run a full matrix per PR.
To reduce the CI job runtime we should maybe run the header include check called headerCheck
in a separate job and only once for each compiler e.g. gcc, clang, hipp, nvcc, intel. There is a very small possibility we miss something if we have a workaround within the code that is only activated for a specific compiler version.
IMO this risk is very tiny but the saved runtime is gigantic.
We could run scheduled tests once per week/month where we run the header check within each test. In a case where we missed an include the fix is very easy and can be done without deep knowledge about the PR where the missing include was introduced.
The header checks are already only run as part of the analysis
CI runs. All other CI runs should not run header checks.
The header checks are already only run as part of the
analysis
CI runs. All other CI runs should not run header checks.
That is good to know, I was not aware that we do not perform it always.
[update] on the gitlab CI we always perform the header check.
I realized that checking out boost even with --depth 1
took ages, we should check if it is faster to download from github the tarball.
Or we just download the system boost in most runs, which is even faster.
Or we just download the system boost in most runs, which is even faster.
I also thought about it but I think it is harder to get FIBER running and I am not sure if the system boost is providing all versions. I have the patch for downloading the tar now ready and tested it within a docker container.
using `git`: 4 minutes
using wget and un-tar: ~2 sec
I waiting for https://github.com/alpaka-group/alpaka/pull/1537 and then we can apply your open optimization and the wget PR.
I am not sure if the system boost is providing all versions.
It doesn't. It just provides one version. But I guess this is good enough for many of the CI builds. We can use an alternative boost on only a few CI jobs.
I have the patch for downloading the tar now ready and tested it within a docker container.
using `git`: 4 minutes using wget and un-tar: ~2 sec
Ok that is amazing! PR super welcome!
It doesn't. It just provides one version. But I guess this is good enough for many of the CI builds. We can use an alternative boost on only a few CI jobs.
IMO one version is not enough. Boost is one of the biggest issues we had and testing boost with clang-cuda/nvcc in various combinations is important. The problem is that boost itself is not testing nvcc and the usage within kernels.
After the CI has seen a few revisions in the past weeks I'm fairly content with the current state of things. If nobody objects (@bernhardmgruber, @psychocoderHPC) I believe we can close this.
There should be a way of not running the CI if no code (or CI code) is touched. Example: #1860 only updates README.md
but technically we have to wait for the whole CI to finish.
I would close the issue and open a new one with that feature. Otherwise we will end up stuffing too much into this one issue.