simsopt icon indicating copy to clipboard operation
simsopt copied to clipboard

Fix CI

Open landreman opened this issue 1 year ago • 13 comments

The CI appears to be broken. For the extensive tests with python 3.10 and 3.11 but not 3.9, there is a failure early on when pyoculus is installed.

This PR doesn't yet have a fix, just documenting that the problem occurs even for the master branch.

landreman avatar Jan 08 '25 20:01 landreman

The error appears to be related to f2py:

    File "/tmp/pip-build-env-78kkn94r/overlay/lib/python3.11/site-packages/numpy/f2py/capi_maps.py", line 377, in getpydocsign
      sig = '%s : %s %s%s' % (a, opt, c2py_map[ctype], init)
                                      ~~~~~~~~^^^^^^^
  KeyError: 'void'

I seem to remember this error appearing in the simsopt CI also last summer when numpy v2 was released, but I'm not certain. @smiet @mbkumar any ideas how to fix this?

landreman avatar Jan 08 '25 20:01 landreman

Looks like numpy>=2.0 issue with pyoculus. I solved a similar issue for VMEC. Perhaps it could be replicated here. I'll do a more thorough check and get back to you.

mbkumar avatar Jan 08 '25 20:01 mbkumar

In VMEC, we moved away from numpy.distutils which is deprecated and not working properly anymore. Pyoculus is still using numpy.distutils. This approach should be updated and a more robust setup function such as the one from scikit-build should be used. @smiet used a meson build system for SPEC. Perhaps he may be interested in using the same one for PyOculus. Otherwise, I can take the lead and update the build system to a new one based on scikit-build and cmake.

mbkumar avatar Jan 08 '25 21:01 mbkumar

There is a branch with meson-build, but it still has an issue (see: this issue on pyoculus.

Issue seems to originate from numpy 2.1 or 2.2 change. Have not reproduced the issue myself, but we might need to clamp 2.0 <= numpy <= 2.1.

I have currently clamped the numpy version used for the pyoculus build to numpy 2.0, this should fix it temporarily, hopefully until the new pyoculus version I am working on is ready for prime-time.

There is a meson-build for pyoculus ready-to-go, but I'm not sure it works on all mac versions. Currently traveling, will work on making that master when I get back next week, but now don't have the time to deal with potential fall-out such change could bring.

smiet avatar Jan 09 '25 11:01 smiet

@landreman The CI is still broken after Chris fixes for pyoculus. Two tests are failing. Need to figure out the source of the failures.

mbkumar avatar Jan 10 '25 12:01 mbkumar

Ugh, yes, why are all these tests failing in the CI now?

test_curve_first_derivative (geo.test_curve_optimizable.Testing)
test_Bn (geo.test_pm_grid.Testing)
test_boozer_surface_optimisation_convergence (geo.test_boozersurface.BoozerSurfaceTests)
test_convergence_cpp_and_notcpp_same (geo.test_boozersurface.BoozerSurfaceTests)
test_iotas_derivative (geo.test_surface_objectives.IotasTests)

These tests all passed when simsopt release v1.8.1 was made on Nov 22, and there have been no changes to master since then, so the failures must be due to a change in one or more of the libraries that we depend on. @andrewgiuliani @akaptano any ideas?

landreman avatar Jan 13 '25 15:01 landreman

For the extensive CI, two tests fail for python 3.9:

test_curve_first_derivative (geo.test_curve_optimizable.Testing)
test_Bn (geo.test_pm_grid.Testing)

Comparing the successful Nov 22 test log "Install simsopt package" around line 2857 to a recent log "Install simsopt package" around line 2074, the numpy version was the same, 1.26.4, and the version of scipy was the same, 1.13.1. So the numpy and scipy versions can't explain those test failures with python 3.9.

In python 3.10 where those other tests fail: Comparing the successful Nov 22 test log "Install simsopt package" around line 2820 to a recent log "Install simsopt package" around line 2000, the numpy version increased from 1.26.4 to 2.2.1, and the version of scipy increased from 1.14.1 to 1.15.0. So the newer numpy as scipy versions might explain the new test failures in

test_boozer_surface_optimisation_convergence (geo.test_boozersurface.BoozerSurfaceTests)
test_convergence_cpp_and_notcpp_same (geo.test_boozersurface.BoozerSurfaceTests)
test_iotas_derivative (geo.test_surface_objectives.IotasTests)

landreman avatar Jan 13 '25 17:01 landreman

Some updates: The test

test_iotas_derivative (geo.test_surface_objectives.IotasTests)

is now passing - I just slightly loosened a tolerance from 0.3 to 0.31. The test

test_Bn (geo.test_pm_grid.Testing)

is now passing, after loosening an atol slightly from 0 to 1e-15. That leaves 3 failing tests.

The failing tests in

test_boozer_surface_optimisation_convergence (geo.test_boozersurface.BoozerSurfaceTests)
test_convergence_cpp_and_notcpp_same (geo.test_boozersurface.BoozerSurfaceTests)

for python 3.10 and 3.11 are still not understood. In the branch ml/20250113-debug_CI the Tests CI runs only these tests so we can debug them faster. Feel free to push to this branch for debugging.

The test failure

test_curve_first_derivative (geo.test_curve_optimizable.Testing)

for all python versions is also not understood - now obj.J() returns nan. In the 20250113-debug_CI branch, this test is run by itself, and it passes. How could the result of obj.J() depend on whether or not other tests are run?

landreman avatar Jan 15 '25 10:01 landreman

Given that a few tests are solved with fixes to the tolerances, this might be related to changes in data type promotion in numpy 2.x . Though during execution numpy 1.26 is used, the C-API is compiled using numpy 2.2. There have been changes to these parts, and the 2.2.1 update specifically mentions breakage in downstream projects due to typing changes.

In some cases this causes lower-precision datatypes which could fail the tests. Though this is still strange, as runtime 1.26 should keep the promotion the same.

see: https://numpy.org/neps/nep-0050-scalar-promotion.html#nep50

the numpy2.0 migration guide does list a way to raise a warning when this happens: https://numpy.org/doc/2.2/numpy_2_0_migration_guide.html#changes-to-numpy-data-type-promotion

We could test np.set_promotion_state90('weak_and_warn') (or 'legacy') in the failing tests to see if this is the cause.

Alternatively fixing the build-time numpy version to 2.0 is potentially a short-term fix that I will try now.

smiet avatar Jan 15 '25 13:01 smiet

I believe the main issue is os was updated to Ubuntu 24.04. When I specified Ubuntu 22.04 as the OS in a different branch, there was only one failure

On Wed, Jan 15, 2025, 8:11 AM Chris Smiet @.***> wrote:

Given that a few tests are solved with fixes to the tolerances, this might be related to changes in data type promotion in numpy 2.x . Though during execution numpy 1.26 is used, the C-API is compiled using numpy 2.2. There have been changes to these parts, and the 2.2.1 update specifically mentions breakage in downstream projects due to typing changes.

In some cases this causes lower-precision datatypes which could fail the tests. Though this is still strange, as runtime 1.26 should keep the promotion the same.

see: https://numpy.org/neps/nep-0050-scalar-promotion.html#nep50

the numpy2.0 migration guide does list a way to raise a warning when this happens: https://numpy.org/doc/2.2/numpy_2_0_migration_guide.html#changes-to-numpy-data-type-promotion

We could test np.set_promotion_state90('weak_and_warn') (or 'legacy') in the failing tests to see if this is the cause.

Alternatively fixing the build-time numpy version to 2.0 is potentially a short-term fix that I will try now.

— Reply to this email directly, view it on GitHub https://github.com/hiddenSymmetries/simsopt/pull/467#issuecomment-2592821009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA62VEDGYRACOGLB5XNWD732KZM6ZAVCNFSM6AAAAABU2WWVP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJSHAZDCMBQHE . You are receiving this because you were mentioned.Message ID: @.***>

mbkumar avatar Jan 15 '25 13:01 mbkumar

I agree with @mbkumar that the ubuntu version must be part of the story. In this run of the ml/20250113-debug_CI branch, which directly compares ubuntu-22.04 vs ubuntu-latest, the test_boozer_surface_optimisation_convergence tests all pass in the former and fail in the latter. The default ubuntu version in the CI changed in December, so the timing is right. Some differences between the os versions are documented here. Looking at all the versions of python packages at the end of the "Install simsopt package" step, it looks like they are all the same between the older vs newer ubuntu runs, so that is not the issue. While we could have the CI use the older ubuntu version for now, it would be better to get the CI working with the latest ubuntu version.

In that CI run, I printed out the residual vs iteration for those failing tests, and it looks like Newton's method is diverging rather than converging. Example:

!!!!!! Testing surfacetype SurfaceXYZTensorFourier, stellsym True, optimize_G True, second_stage newton, get_data <function get_giuliani_data at 0x7f87650a5510>, vectorize False
Residual norm after LBFGS 682 2.294509402657461e-03
minimize_boozer_penalty_constraints_newton iter: 0  val: 2.6323866994417486e-06  norm: 0.0006022726438895579
minimize_boozer_penalty_constraints_newton iter: 1  val: 0.00039005450363802563  norm: 0.5041954646638603
minimize_boozer_penalty_constraints_newton iter: 2  val: 5.094615223534711e-06  norm: 0.007185679457999476
minimize_boozer_penalty_constraints_newton iter: 3  val: 4.089872578117214e-05  norm: 0.14172172054575086
minimize_boozer_penalty_constraints_newton iter: 4  val: 8.809182188159334e-06  norm: 0.01242660254456349
minimize_boozer_penalty_constraints_newton iter: 5  val: 0.00013128372620558918  norm: 0.2830093092565208
minimize_boozer_penalty_constraints_newton iter: 6  val: 1.6468981915618705e-05  norm: 0.013756818781153798
minimize_boozer_penalty_constraints_newton iter: 7  val: 10.157595247966423  norm: 194.07052961535098
minimize_boozer_penalty_constraints_newton iter: 8  val: 1.3450970606902033  norm: 50.677055615637116
minimize_boozer_penalty_constraints_newton iter: 9  val: 128.27102807457686  norm: 1449.0650971488906
minimize_boozer_penalty_constraints_newton iter: 10  val: 153.97224755727544  norm: 1855.6508088107519
minimize_boozer_penalty_constraints_newton iter: 11  val: 2779207607344.2334  norm: 84211407168.62497
minimize_boozer_penalty_constraints_newton iter: 12  val: 548978693020.2609  norm: 24951522381.755825
minimize_boozer_penalty_constraints_newton iter: 13  val: 108439976999.2  norm: 7393039889.858839
minimize_boozer_penalty_constraints_newton iter: 14  val: 21420127411.04032  norm: 2190527818.519551
minimize_boozer_penalty_constraints_newton iter: 15  val: 4231085186.618325  norm: 649043599.9812926
minimize_boozer_penalty_constraints_newton iter: 16  val: 835747207.6700873  norm: 192308094.79325154
minimize_boozer_penalty_constraints_newton iter: 17  val: 165075792.47637716  norm: 56979388.17807232
minimize_boozer_penalty_constraints_newton iter: 18  val: 32786359.931228556  norm: 16899121.641762033
minimize_boozer_penalty_constraints_newton iter: 19  val: 6599061.6332561895  norm: 5108009.228141096
minimize_boozer_penalty_constraints_newton iter: 20  val: 1373211.6883342883  norm: 1574328.522892679
minimize_boozer_penalty_constraints_newton norm: 1574328.522892679  tol: 1e-10  success: False

@andrewgiuliani @mbkumar can you think of any reason that Newton iterations would diverge for these tests due to a change in os/gcc version? Do you have any suggestions how to fix this?

The test geo.test_boozersurface.BoozerSurfaceTests.test_convergence_cpp_and_notcpp_same is fixed now by slightly loosening a tolerance.

For the other failing test, geo.test_curve_optimizable.Testing.test_curve_first_derivative: By printing info at curveobjectives.py line 43 in the CI, it looks like the problem is coming from self.curve.incremental_arclength(), which for RotatedCurve is returning NaNs. Any ideas why?

landreman avatar Jan 21 '25 14:01 landreman

Thanks Matt for the detailed investigation. I don't have answers now. I'll start debugging these issues from tomorrow.

On Tue, Jan 21, 2025, 9:16 AM Matt Landreman @.***> wrote:

I agree with @mbkumar https://github.com/mbkumar that the ubuntu version must be part of the story. In this run of the ml/20250113-debug_CI branch https://github.com/hiddenSymmetries/simsopt/actions/runs/12871897346, which directly compares ubuntu-22.04 vs ubuntu-latest, the test_boozer_surface_optimisation_convergence tests all pass in the former and fail in the latter. The default ubuntu version in the CI changed in December, so the timing is right. Some differences between the os versions are documented here https://github.com/actions/runner-images/issues/10636. Looking at all the versions of python packages at the end of the "Install simsopt package" step, it looks like they are all the same between the older vs newer ubuntu runs, so that is not the issue. While we could have the CI use the older ubuntu version for now, it would be better to get the CI working with the latest ubuntu version.

In that CI run, I printed out the residual vs iteration for those failing tests, and it looks like Newton's method is diverging rather than converging. Example:

!!!!!! Testing surfacetype SurfaceXYZTensorFourier, stellsym True, optimize_G True, second_stage newton, get_data <function get_giuliani_data at 0x7f87650a5510>, vectorize False Residual norm after LBFGS 682 2.294509402657461e-03 minimize_boozer_penalty_constraints_newton iter: 0 val: 2.6323866994417486e-06 norm: 0.0006022726438895579 minimize_boozer_penalty_constraints_newton iter: 1 val: 0.00039005450363802563 norm: 0.5041954646638603 minimize_boozer_penalty_constraints_newton iter: 2 val: 5.094615223534711e-06 norm: 0.007185679457999476 minimize_boozer_penalty_constraints_newton iter: 3 val: 4.089872578117214e-05 norm: 0.14172172054575086 minimize_boozer_penalty_constraints_newton iter: 4 val: 8.809182188159334e-06 norm: 0.01242660254456349 minimize_boozer_penalty_constraints_newton iter: 5 val: 0.00013128372620558918 norm: 0.2830093092565208 minimize_boozer_penalty_constraints_newton iter: 6 val: 1.6468981915618705e-05 norm: 0.013756818781153798 minimize_boozer_penalty_constraints_newton iter: 7 val: 10.157595247966423 norm: 194.07052961535098 minimize_boozer_penalty_constraints_newton iter: 8 val: 1.3450970606902033 norm: 50.677055615637116 minimize_boozer_penalty_constraints_newton iter: 9 val: 128.27102807457686 norm: 1449.0650971488906 minimize_boozer_penalty_constraints_newton iter: 10 val: 153.97224755727544 norm: 1855.6508088107519 minimize_boozer_penalty_constraints_newton iter: 11 val: 2779207607344.2334 norm: 84211407168.62497 minimize_boozer_penalty_constraints_newton iter: 12 val: 548978693020.2609 norm: 24951522381.755825 minimize_boozer_penalty_constraints_newton iter: 13 val: 108439976999.2 norm: 7393039889.858839 minimize_boozer_penalty_constraints_newton iter: 14 val: 21420127411.04032 norm: 2190527818.519551 minimize_boozer_penalty_constraints_newton iter: 15 val: 4231085186.618325 norm: 649043599.9812926 minimize_boozer_penalty_constraints_newton iter: 16 val: 835747207.6700873 norm: 192308094.79325154 minimize_boozer_penalty_constraints_newton iter: 17 val: 165075792.47637716 norm: 56979388.17807232 minimize_boozer_penalty_constraints_newton iter: 18 val: 32786359.931228556 norm: 16899121.641762033 minimize_boozer_penalty_constraints_newton iter: 19 val: 6599061.6332561895 norm: 5108009.228141096 minimize_boozer_penalty_constraints_newton iter: 20 val: 1373211.6883342883 norm: 1574328.522892679 minimize_boozer_penalty_constraints_newton norm: 1574328.522892679 tol: 1e-10 success: False

@andrewgiuliani https://github.com/andrewgiuliani @mbkumar https://github.com/mbkumar can you think of any reason that Newton iterations would diverge for these tests due to a change in os/gcc version? Do you have any suggestions how to fix this?

The test geo.test_boozersurface.BoozerSurfaceTests.test_convergence_cpp_and_notcpp_same is fixed now by slightly loosening a tolerance.

For the other failing test, geo.test_curve_optimizable.Testing.test_curve_first_derivative: By printing info at curveobjectives.py line 43 in the CI, it looks like the problem is coming from self.curve.incremental_arclength(), which for RotatedCurve is returning NaNs. Any ideas why?

— Reply to this email directly, view it on GitHub https://github.com/hiddenSymmetries/simsopt/pull/467#issuecomment-2604857007, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA62VEDZDJCO4GYDKFIUYJD2LZJE7AVCNFSM6AAAAABU2WWVP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBUHA2TOMBQG4 . You are receiving this because you were mentioned.Message ID: @.***>

mbkumar avatar Jan 21 '25 16:01 mbkumar

@landreman I encountered an issue that seems very similar to this when trying to get my pull request (#475) to pass the tests earlier this week. I found that during the tests, calls to scipy.linalg.qr() would erroneously return matrices with nan elements in cases when the input matrix should have had a well-defined QR factorization. Similarly to your experience, this only happened when the tests were run in GitHub Actions; I could not reproduce this result on my machine.

I eventually found that if I simply called scipy.linalg.qr() a second time with the same input, it would return the correct matrices with no nan values. So as a quick and dirty fix, wrapped scipy.linalg.qr() with a function that calls scipy.linalg.qr(), checks the outputs, and calls it a second time if it detects nan values in the outputs from the first attempt. This at least allowed it to pass the CI tests.

After some further debugging, the issue with scipy.linalg.qr() seemed to be triggered (quite inexplicably) if somewhere earlier in the workflow, a magnetic field evaluation by a MagneticField object had been performed. Following a magnetic field evaluation, the first call to scipy.linalg.qr() produces nan results. However, simply calling scipy.linalg.qr() a second time seems to "fix" whatever problem was created by the field evaluation.

On this branch in my fork of Simsopt (https://github.com/kchammond/simsopt/tree/wireframe_github_debug), I've created a workflow that explores this issue a bit. See in particular the function test_4_min_working in tests/github_debug/test_wf_factorization.py. Interestingly, the problem only seems to come up when the MagneticField subclass actually performs a field evaluation; the problem is not induced in instances where the MagneticField class draws values from its cache.

I've seen some discussions of similar issues to this on the Scipy and Numpy repos (https://github.com/scipy/scipy/issues/5586) and (https://github.com/numpy/numpy/issues/20356) -- all of which involve linear algebra functions in Scipy and Numpy outputting nan values unexpectedly.

If the errors you describe are anything like the one I dealt with, any chance simply calling the relevant functions a second time would "fix" things?

kchammond avatar Feb 28 '25 17:02 kchammond