PyBaMM icon indicating copy to clipboard operation
PyBaMM copied to clipboard

[WIP] Download IDAKLU from pybammsolvers

Open kratman opened this issue 1 year ago • 7 comments

Description

This will separate the IDAKLU C++ code from pybamm.

Type of change

This should speed up CI by skipping the build of the C++ code.

  • [x] Optimization (back-end change that speeds up the code)

Key checklist:

  • [x] No style issues: $ pre-commit run (or $ nox -s pre-commit) (see CONTRIBUTING.md for how to set this up to run automatically when committing locally, in just two lines of code)
  • [x] All tests pass: $ python run-tests.py --all (or $ nox -s tests)
  • [x] The documentation builds: $ python run-tests.py --doctest (or $ nox -s doctests)

You can run integration tests, unit tests, and doctests together at once, using $ python run-tests.py --quick (or $ nox -s quick).

Further checks:

  • [x] Code is commented, particularly in hard-to-understand areas
  • [x] Tests added that prove fix is effective or that feature works

kratman avatar Oct 03 '24 16:10 kratman

A new link error cropped up, but it looks like we could get a lot of savings on time with this update.

Edit: Most of the run time appears to be in the integration tests, so unfortunately the time savings are not as good as I would have hoped.

kratman avatar Oct 03 '24 17:10 kratman

The linkage error is the same one as #3783, coming from CasADi's plugin system. I am not sure if it's worth fixing it, since it was fixed by @martinjrobins for the linear interpolant case by dropping down to Python but IIRC there wasn't a way in CasADi for doing it for the cubic

agriyakhetarpal avatar Oct 03 '24 18:10 agriyakhetarpal

@agriyakhetarpal Yeah I was looking at that issue as well. As far as I can tell CasADI sets a path for plugins. I am trying to see if there is a decent workaround since this was part of #4464

My guess is that the wheels for the next release will be broken as well, but I have not confirmed it yet

kratman avatar Oct 03 '24 20:10 kratman

There is a workaround for Linux and macOS, but not for Windows (different toolchain); sadly, it's not decent enough to include. I think I'll raise a PR upstream in CasADi to get one part of the linkage going and see if we can migrate to a non-MSVC toolchain (which can potentially help provide that workaround for this on Windows later on). It's been on my list of things to do for a while, but I've yet to do it.

agriyakhetarpal avatar Oct 03 '24 20:10 agriyakhetarpal

This is fixed locally with this: export CASADIPATH=.venv/lib/python3.12/site-packages/casadi

kratman avatar Oct 03 '24 22:10 kratman

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 98.66%. Comparing base (a7253b8) to head (c9a75e2). Report is 135 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4487      +/-   ##
===========================================
- Coverage    99.22%   98.66%   -0.56%     
===========================================
  Files          303      303              
  Lines        23070    23224     +154     
===========================================
+ Hits         22891    22914      +23     
- Misses         179      310     +131     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Oct 03 '24 22:10 codecov[bot]

This is fixed locally with this: export CASADIPATH=.venv/lib/python3.12/site-packages/casadi

Yes, won't work with Windows

agriyakhetarpal avatar Oct 03 '24 23:10 agriyakhetarpal

I had a look at this. The linker error is the same as I came across for the case of linear interpolation. The solution there was to swap to using the direct casadi function rather than their plugin system, which won't work if the casadi function is evaluated in C++ for windows as we compile everything statically.

I think I might be able to access the direct bspline interface by calculating the spline coefficients in scipy and then use the casadi Function.bspline function to construct a bspline. Cross fingers this doesn't use the plugin system anywhere! Going to try this out in https://github.com/pybamm-team/PyBaMM/issues/4570

martinjrobins avatar Nov 06 '24 16:11 martinjrobins

I had a look at this. The linker error is the same as I came across for the case of linear interpolation. The solution there was to swap to using the direct casadi function rather than their plugin system, which won't work if the casadi function is evaluated in C++ for windows as we compile everything statically.

I think I might be able to access the direct bspline interface by calculating the spline coefficients in scipy and then use the casadi Function.bspline function to construct a bspline. Cross fingers this doesn't use the plugin system anywhere!

Yeah I was going to approach this by seeing if I could just change the build itself. It is something that should work if we are compiling and delivering everything correctly. If that does not work, then I will look at workarounds for interpolation

kratman avatar Nov 06 '24 16:11 kratman

I expect to work on this again next week, I have been caught up with other stuff

kratman avatar Nov 06 '24 16:11 kratman

looks like there is still issues with the idaklu jax solver on windows, I can look into these?

martinjrobins avatar Nov 21 '24 10:11 martinjrobins

@martinjrobins Sure if you want to look at it you are more than welcome. I am hopefully going to be able to take another look this evening

I recently got a Windows laptop so I could start looking into this stuff locally. Most of my commits to this branch recently have been me testing things for the release as I have been focused on getting that out the door

kratman avatar Nov 21 '24 14:11 kratman

I tried to figure this one out today but no luck :( It's crashing with a fatal exception when jax tries to jit compile, I'm still in the dark as to why. It might be a threading issue as the problem is intermittant (occurs in about 95% of test runs). It might be triggered by some interaction with pytest because when I copy the test into a stand-alone script it works fine

martinjrobins avatar Nov 21 '24 14:11 martinjrobins

For a stopgap solution, we can isolate these tests into their own xdist_group and allow only one worker to touch them at a time.

agriyakhetarpal avatar Nov 21 '24 17:11 agriyakhetarpal

For a stopgap solution, we can isolate these tests into their own xdist_group and allow only one worker to touch them at a time.

Yeah that is my fallback option.

I want to take a closer look at the linking/delivery as well. We have failures on windows when you download the wheels:

  • pybammsolvers has some crashed workers
  • my i5 (without AVX-512 instructions) has ~45 test failures on both 24.9.0 and 24.11.0 when running tests with the wheels
  • A colleague's i7 (with AVX-512 instructions) has ~25 test failures on both 24.9.0 and 24.11.0 when running tests with the wheels

So it appears that the tests are working when you test in the build environment, but not in a different environment. I will be digging into this more and see what I come up with

kratman avatar Nov 21 '24 18:11 kratman

Hi - I took a very quick look at this yesterday and agree that it seems to be a threading issue. More specifically, jaxify() can only be called once per solver instance (this is the first test), which then caches the full solve result so the jax-wrapper can query samples without repeatedly re-running the solver. My suspicion is that running these tests in parallel is causing test pollution, probably because the test script currently instantiates the solver and jax wrapper objects at the start of the test script, not as a fixture for each test (although the ubuntu tests should also be failing?). Refactoring the tests with fixtures would be good to see if that resolves things - I can take a look at that if you like but if I'm right then the xdist_group solution should also work if you need a quick fix. I can't remember the precise details as to why we can't jaxify more than once per object, but I do remember that it was more complex than just the cache issue (something to do with the jax primitives...).

jsbrittain avatar Nov 22 '24 09:11 jsbrittain

Just a small note to say that it is not just a matter of running the tests in serial to make them pass. I had to turn off both the pytest workers, and the faulthandler, pytest -n 0 -p no:faulthandler, before the tests would pass. With these options all the tests in test_idaklu_jax.py pass reliably

martinjrobins avatar Nov 22 '24 13:11 martinjrobins

Testing out the skips and test refactor now with CI. I have some docs stuff to update then this should be mostly ready to go. I will do one more update to pybammsolvers to make sure that the versions of IDAKLU source files match

kratman avatar Dec 26 '24 19:12 kratman

Tests pass, just need to do some documentation fixes

kratman avatar Dec 26 '24 21:12 kratman

Additional documentation will be added to the pybammsolvers repo

kratman avatar Dec 27 '24 18:12 kratman

@MarcBerliner, @martinjrobins Ok I think this is finally working. I am working on tests and docs for the other repo now

kratman avatar Dec 27 '24 19:12 kratman

@agriyakhetarpal I know we have not solved the ARM64 or conda-forge issues yet, but how do you feel about getting this merged ASAP to see if we start getting issues reported?

kratman avatar Jan 07 '25 17:01 kratman

Note: The changes from #4736 are not in pybammsolvers yet. The tests pass without the C++ side for now though. I am working on the pybammsolvers v0.0.5 release, but I have to do a bit of testing before it is ready. Hopefully I will finish that off today

kratman avatar Jan 07 '25 18:01 kratman