Manifolds.jl icon indicating copy to clipboard operation
Manifolds.jl copied to clipboard

Speed up CI

Open sethaxen opened this issue 4 years ago • 21 comments

There are a few more things we can do to speed up CI and avoid timeouts:

  1. Split the tests into groups that are run in parallel. Choose an environment variable that if set can filter the tests to be run down to a group, defaulting to all. Then we set that variable in the build matrix in CI so the groups run in parallel.

  2. Switch to another CI provider e.g. GitHub Actions will let us run 20 Linux jobs in parallel, with 6 hour time limits. There's the org https://github.com/julia-actions, which has a number of tools to make GitHub Actions easy with Julia projects.

GitHub Actions is a little less user friendly for CI than Travis, and either way, it will benefit from test groups, so my thinking is to start with (1) above.

sethaxen avatar Apr 01 '20 20:04 sethaxen

I think this issue is finished, well maybe up to the point whether to deactivate AppVeyor and/or Travis?

kellertuer avatar Apr 09 '20 15:04 kellertuer

I think #150 is supposed to resolve this.

mateuszbaran avatar Apr 09 '20 15:04 mateuszbaran

GitHub Actions runs the test suite pretty fast, so I only saw about a 10 minute speed-up in CI with #150. But because the groups spawn so many jobs, it also fills up the 20-job queue pretty quickly, so I think it's no that useful yet, but probably will be later. I think we could deactivate Travis and Appveyor now and just revert that PR if we have problems with GH actions.

sethaxen avatar Apr 09 '20 18:04 sethaxen

We finished point 2 of this issue, but our tests run for about an hour currently. Would a parallelisation help? I am currently not sure how we would do a parallel test.

kellertuer avatar Oct 14 '21 13:10 kellertuer

I think there are no easy solutions here. One idea would be defining environment variables in CI which are read by runtests.jl and enable/disable selected tests. So some different CI jobs would do different tests.

mateuszbaran avatar Oct 14 '21 13:10 mateuszbaran

Main question is whether we want to keep this issue open and want to do this or whether we are fine for now with the current hour of testing

kellertuer avatar Oct 14 '21 13:10 kellertuer

I'm fine with one hour of testing but it's near the upper limit of what is acceptable. And there are some related issues, for example if we wanted to split some part of Manifolds.jl out to a different package this would reduce CI times as well.

mateuszbaran avatar Oct 14 '21 13:10 mateuszbaran

Yes at some point we should do that, maybe for the Lie groups of for the differentiation or such. So lets leave this here open to keep that in mind?

kellertuer avatar Oct 14 '21 13:10 kellertuer

Sure, I think we can keep this open.

mateuszbaran avatar Oct 14 '21 15:10 mateuszbaran

I noticed one thing – namely running Julia with julia -c0 --compile=min a lot of simple manifolds tumble down in time tremendously (for example circle from 32 -> 1.2, Euclidean 56->6, fixed rank 19->1.9,...).

Here is the exceptions (with local times from me, first time normal Julia, second the optimised, times in seconds)

  • hyperbolic: 32 - > 106
  • tucker: 18 -> 63
  • power`: 184 -> 164 (but long enough to maybe be added to my next idea below)
  • metric: 50 -> 89
  • statistics: 47 -> 1000+
  • general_linear: 53 -> 1511
  • special_linear: 64 -> 1699 I will the remaining ones, once this is finished.

So my proposal is to introduce a third group using these files above. Then run the current two ones with the faster compiler option and the third with normal ones. I just have to figure out

  1. How to set command line options in the runtests action
  2. How to modify them by matrix case.

What do you think?

kellertuer avatar Jul 21 '22 15:07 kellertuer

Might also be worth looking at some of the speed-ups used for the ChainRules test suite:

  • https://github.com/JuliaDiff/ChainRules.jl/pull/499
  • https://github.com/JuliaDiff/ChainRules.jl/pull/500

sethaxen avatar Jul 21 '22 15:07 sethaxen

...if only I would understand anything from those changes. I have absolutely no clue what both PRs are doing :(

Both seem to me like magic and do not provide information what is done.

kellertuer avatar Jul 21 '22 15:07 kellertuer

...if only I would understand anything from those changes. I have absolutely no clue what both PRs are doing :(

You're in good company: https://github.com/JuliaDiff/ChainRules.jl/pull/499#issuecomment-898254561

sethaxen avatar Jul 21 '22 15:07 sethaxen

Excellent, with Frames on my side – I am fine I think.

I saw one thing that might do the one thing I describe above. But the timings for some manifolds are also still running, I will report :)

kellertuer avatar Jul 21 '22 15:07 kellertuer

So I think what those PRs are primarily doing is avoiding compiling code and instead running it dynamically. This makes sense when the majority of time is spent compiling. I don't have a good sense if that's the case here, but I wouldn't be at all surprised if it was. JuliaInterpreter for example avoids compilation, and I think @nospecialize may do the same.

sethaxen avatar Jul 21 '22 15:07 sethaxen

It is the case here for all manifolds besides the ones I collect above :)

kellertuer avatar Jul 21 '22 15:07 kellertuer

I am not sure the compiler flags can be set amidst tests, but we will see for worst of cases the above PR just splits of a small part and the rest is not much faster, for best of cases a lot of the tests get a tremendous speedup.

Edit: So locally here the Base experimental way to set the compiler just does not work :/

kellertuer avatar Jul 21 '22 16:07 kellertuer

So at least if we get it to run this way here is some local timings from my machine:

Case I: Optimised compiler

  • Manifolds: 135 seconds
  • Lie groups 92 seconds
  • common part (non compiler optimised stuff): 595 seconds.

Case II Previous version

  • Manifolds: 1075 seconds
  • Lie groups: 831 seconds
  • common part (same time as above): 595 seconds

So one could also move power back from group 3 to 1, when all are running in Parallel we would get more like (300,92,430) in the first block when running in parallel that would be 7:10 minutes (compared to the 17:55 for manifolds in case 2).

In sum we are at 822 seconds (13:42) instead of 2501 seconds (41:41).

Ok this is all locally here, so the CI might be a factor of 2 or 2.5 slower, still, we would get a speedup of about 2-3 :)

kellertuer avatar Jul 21 '22 17:07 kellertuer

Cool, that would be a great speedup :+1: .

mateuszbaran avatar Jul 21 '22 19:07 mateuszbaran

While I started a large PR that I abandoned, I still think we can give this one last try. Here are a few ideas I want to try to work on

  • make the tests more modular, the idea would be to
    1. be able to test one function (e.g. exp) on all manifolds that exist
    2. be able to test a single manifold
    3. combine these with different type for the eltype of matrix type.
  • provide command line access to this tensor of options (similar to make.jl --quarto so that it is relatively easy to also run “reduced” tests locally.

I will first approach the last point. And then go step by step, sometimes even just one function on one manifold per PR; that way it should be something doable over some time. The PR would basically move tests into the new framework and remove them from the old – while both run and we aim to never lose coverage.

With the modularity one could also define a minimal set of manifolds to test as an integration test.

kellertuer avatar Aug 25 '23 09:08 kellertuer

I think doing this step by step is a good approach :+1:

mateuszbaran avatar Aug 28 '23 13:08 mateuszbaran