alpaka icon indicating copy to clipboard operation
alpaka copied to clipboard

Improve CI efficiency

Open j-stephan opened this issue 3 years ago • 14 comments

Discussed offline with @bernhardmgruber: The current way of running the CI every time when a PR is merged is very inefficient. We just tested the code base against the PR, why test it again? This unnecessarily blocks the CI for a few hours.

Instead it might be more useful to run the develop CI on a regular basis. Maybe once per week or once every other day. Thoughts?

j-stephan avatar Dec 07 '21 15:12 j-stephan

The CI is running for the dev branch after a merge because a PR can be based on an old development branch. If you merge two PRs in a row we typically rebase these PRs via github to the latest development branch this could introduce issues even if there is no merge conflict. If so you know which PR is the root o the issue. It is hard to say which PR is breaking the development branch if you test the code only once per week.

psychocoderHPC avatar Dec 07 '21 15:12 psychocoderHPC

To reduce the load I suggest testing for PRs and merging development fewer combinations per compiler and performing a full matrix test once per week. This should catch most issues when the PR is opened.

psychocoderHPC avatar Dec 07 '21 15:12 psychocoderHPC

Another way to reduce the CI load is by staging the tests. This means you test first a handful of combinations and only on successes perform more tests. For example, this is what we do for PIConGPU: https://gitlab.com/hzdr/crp/picongpu/-/pipelines/424201783

psychocoderHPC avatar Dec 07 '21 15:12 psychocoderHPC

Also a solution to #1439 might help.

j-stephan avatar Dec 07 '21 15:12 j-stephan

for the record: PIConGPU is using a CI job generator to avoid that a combination being tested twice. https://github.com/ComputationalRadiationPhysics/picongpu/blob/dev/share/ci/n_wise_generator.py The generator is spaghetti phython but is only running valid combinations e.g. host compiler + CUDA, .... With a generator is is easy to generate different complex matrices, e.g if we do not want to run a full matrix per PR.

psychocoderHPC avatar Dec 07 '21 15:12 psychocoderHPC

To reduce the CI job runtime we should maybe run the header include check called headerCheck in a separate job and only once for each compiler e.g. gcc, clang, hipp, nvcc, intel. There is a very small possibility we miss something if we have a workaround within the code that is only activated for a specific compiler version. IMO this risk is very tiny but the saved runtime is gigantic. We could run scheduled tests once per week/month where we run the header check within each test. In a case where we missed an include the fix is very easy and can be done without deep knowledge about the PR where the missing include was introduced.

psychocoderHPC avatar Dec 15 '21 10:12 psychocoderHPC

The header checks are already only run as part of the analysis CI runs. All other CI runs should not run header checks.

bernhardmgruber avatar Dec 15 '21 10:12 bernhardmgruber

The header checks are already only run as part of the analysis CI runs. All other CI runs should not run header checks.

That is good to know, I was not aware that we do not perform it always.

[update] on the gitlab CI we always perform the header check.

psychocoderHPC avatar Dec 15 '21 10:12 psychocoderHPC

I realized that checking out boost even with --depth 1 took ages, we should check if it is faster to download from github the tarball.

psychocoderHPC avatar Dec 15 '21 14:12 psychocoderHPC

Or we just download the system boost in most runs, which is even faster.

bernhardmgruber avatar Dec 15 '21 14:12 bernhardmgruber

Or we just download the system boost in most runs, which is even faster.

I also thought about it but I think it is harder to get FIBER running and I am not sure if the system boost is providing all versions. I have the patch for downloading the tar now ready and tested it within a docker container.

using `git`: 4 minutes
using wget and un-tar: ~2 sec

I waiting for https://github.com/alpaka-group/alpaka/pull/1537 and then we can apply your open optimization and the wget PR.

psychocoderHPC avatar Dec 15 '21 15:12 psychocoderHPC

I am not sure if the system boost is providing all versions.

It doesn't. It just provides one version. But I guess this is good enough for many of the CI builds. We can use an alternative boost on only a few CI jobs.

I have the patch for downloading the tar now ready and tested it within a docker container.

using `git`: 4 minutes
using wget and un-tar: ~2 sec

Ok that is amazing! PR super welcome!

bernhardmgruber avatar Dec 15 '21 15:12 bernhardmgruber

It doesn't. It just provides one version. But I guess this is good enough for many of the CI builds. We can use an alternative boost on only a few CI jobs.

IMO one version is not enough. Boost is one of the biggest issues we had and testing boost with clang-cuda/nvcc in various combinations is important. The problem is that boost itself is not testing nvcc and the usage within kernels.

psychocoderHPC avatar Dec 15 '21 15:12 psychocoderHPC

After the CI has seen a few revisions in the past weeks I'm fairly content with the current state of things. If nobody objects (@bernhardmgruber, @psychocoderHPC) I believe we can close this.

j-stephan avatar Mar 29 '22 09:03 j-stephan

There should be a way of not running the CI if no code (or CI code) is touched. Example: #1860 only updates README.md but technically we have to wait for the whole CI to finish.

j-stephan avatar Dec 09 '22 11:12 j-stephan

I would close the issue and open a new one with that feature. Otherwise we will end up stuffing too much into this one issue.

bernhardmgruber avatar Dec 09 '22 12:12 bernhardmgruber