CI Integration Example SpeedyWeather.jl
@vchuravy suggested I add an example using SpeedyWeather.jl to the integration tests.
The test is pretty much the same thing we do in the paper. It's a sensitivity analysis of a single grid point and it checks that this runs without errors and the gradient makes physical sense, so the gradient is localised around the selected grid point and small far away from it.
Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.
Click here to view the suggested changes.
diff --git a/test/integration/SpeedyWeather/runtests.jl b/test/integration/SpeedyWeather/runtests.jl
index ea3e354c..c606e9fd 100644
--- a/test/integration/SpeedyWeather/runtests.jl
+++ b/test/integration/SpeedyWeather/runtests.jl
@@ -6,7 +6,7 @@
using SpeedyWeather, Enzyme, Test
spectral_grid = SpectralGrid(trunc = 32, nlayers = 8) # define resolution
-model = PrimitiveWetModel(; spectral_grid, physics=false) # construct model
+model = PrimitiveWetModel(; spectral_grid, physics = false) # construct model
# physics = false to accelate the test
simulation = initialize!(model)
initialize!(simulation)
Thanks Max!
@giordano can we make the filtering more aggressive and only run CI on the integration test being changed?
We could generate the matrix dynamically, but only if test/integration/ files only have been modified, otherwise I think we want to run all the tests, right?
BTW, this is missing adding SpeedyWeather to https://github.com/EnzymeAD/Enzyme.jl/blob/ab96c82667b3999ac624dfff6eaabdbc376d40cc/.github/workflows/Integration.yml#L47-L56
Oh, yes. I just added it.
Besides the fact Test isn't loaded, the jobs are timing out. Any way to have something lighter to test? All other jobs finish in less than 20 minutes (most of them are a lot faster)
Yeah, in our own CI a very similar test takes about 35 minutes. I hoped it's a bit faster now after recent Enzyme patches, but it doesn't seem to be the case.
If 20 min is the limit (didn't know that), then I can turn off some of the parametrizations and see if that's sufficient. I am out of time to really experiment with that for today though.
Timeout is 45 minutes https://github.com/EnzymeAD/Enzyme.jl/blob/ab96c82667b3999ac624dfff6eaabdbc376d40cc/.github/workflows/Integration.yml#L32 but all other jobs take less than half of that.
I mean the gradient compile time is what it is currently 🤷 On my laptop and my HPC it's a bit faster, but in the GitHub Actions CI it's really quite slow.
I updated the example to mirror more closely the one we already use in your CI in Speedy. It should be faster now, and there it always finishes < 40 min.
However, it also seems that Enzymev0.13.105 broke something for us, as we are getting fails from our CI tests on this since Friday. So this example here will also currently fail.
well thats certainly a reason for getting the integration tests in!
that said the erring part is a bit too complex/baked in, would you be able to construct a MWE of?
Codecov Report
:x: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 67.78%. Comparing base (30e0519) to head (0b438af).
:warning: Report is 5 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/rules/customrules.jl | 0.00% | 4 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2784 +/- ##
==========================================
- Coverage 67.79% 67.78% -0.01%
==========================================
Files 58 58
Lines 20723 20726 +3
==========================================
Hits 14050 14050
- Misses 6673 6676 +3
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
okay I think I fixed the custom rule issue immediately here (and a potential PR for more information on the 1.11: https://github.com/EnzymeAD/Enzyme/pull/2582).
However, on a debug julia I pretty much always hit GC errors. @vchuravy I think we really ought debug and fix that MWE speedweather GC error issue before we can land this
Thanks for your effort on this already. I already had a debugging session today with this, but it's quite nasty to boil this down to a proper MWE. I know which intermediate function from our code causes the error, but each individual function it calls is fine. This is the function for our simplest model, and here's an example call with Enzyme, but in the next days I'll try to condense it to a proper MWE, it's just not that easy to boil this down so quickly.
I continue to struggle a bit to come up with a MWE that doesn't use our data structures.
Below is the simplest function in our code that produces the error. Each of the individual functions I can differentiate just fine, it's the interplay between them that seemingly causes an error here.
using Enzyme, SpeedyWeather
function transform_debug!(
diagn::DiagnosticVariables,
progn::PrognosticVariables,
lf::Integer,
model::Barotropic;
kwargs...
)
# retrieve data
(; vor_grid, u_grid, v_grid ) = diagn.grid
(; scratch_memory) = diagn.dynamics
vor = get_step(progn.vor, lf) # relative vorticity at leapfrog step lf
U = diagn.dynamics.a # reuse work arrays for velocities in spectral
V = diagn.dynamics.b # reuse work arrays for velocities in spectral
# U = u*coslat, V=v*coslat
S = model.spectral_transform
# COMPUTE FUNCTIONS, WHEN At least 3 are active -> GC error
# that's a spherical harmonics transforms (KA kernel + FFT)
transform!(vor_grid, vor, scratch_memory, S) # get vorticity on grid from spectral vor
# that's a spatial derivative computed in spectral space (KA kernel)
SpeedyWeather.UV_from_vor!(U, V, vor, S)
# that's a spherical harmonics transforms and some rescaling (KA kernel + FFT + KA kernel)
transform!(u_grid, U, scratch_memory, S, unscale_coslat=true)
#transform!(v_grid, V, scratch_memory, S, unscale_coslat=true)
return nothing
end
spectral_grid = SpectralGrid(trunc=9, nlayers=1)
model = BarotropicModel(; spectral_grid)
simulation = initialize!(model)
progn, diagn, model = SpeedyWeather.unpack(simulation)
lf2 = 2
dprogn = make_zero(progn)
ddiagn = make_zero(diagn)
autodiff(Reverse, transform_debug!, Const, Duplicated(diagn, ddiagn), Duplicated(progn, dprogn), Const(lf2), Const(model))
which causes
[29800] signal (11.2): Segmentation fault: 11
in expression starting at REPL[12]:1
_ZN4llvm5Value11setNameImplERKNS_5TwineE at /Users/max/.julia/juliaup/julia-1.10.10+0.aarch64.apple.darwin14/lib/julia/libLLVM.dylib (unknown line)
Allocations: 66874844 (Pool: 66783213; Big: 91631); GC: 92
Segmentation fault: 11
and was fine before with Enzyme <0.13.105
Is #2818 related to this here? Just let me know if I should invest a bit more time for finding a really proper MWE, or if that's already it.
No that's unrelated and an issue from an in progress PR which doesn't occur on main atm
The whole issue just confuses me.
Now, I found that just leaving away the kwargs... in the function signature and the @boundscheck in the function makes it work. The individual functions are all fine to differentiate as well, and the custom rules that are used for the _fourier! function also pass their tests and are fine to differentiate in other contexts.
So in the following script, using some of SpeedyWeather's numerics I just present the just slightly different versions where the last one that includes an unused kwargs... and a @boundscheck causes a SegFault.
Julia 1.10, Enzyme v0.13.108, SpeedyWeather main
using Enzyme, SpeedyWeather
# this works
function transform_debug!(
vor_grid, u_grid, scratch_memory, U,
vor,
S
)
# catch incorrect sizes early
@boundscheck SpeedyWeather.ismatching(S, vor_grid) || throw(DimensionMismatch(S, vor_grid))
@boundscheck SpeedyWeather.ismatching(S, vor) || throw(DimensionMismatch(S, vor))
g_north = scratch_memory.north
g_south = scratch_memory.south
# this is a FFT, for which we have a custom rule defined
SpeedyWeather.SpeedyTransforms._fourier!(vor_grid, g_north, g_south, S)
# catch incorrect sizes early
@boundscheck SpeedyWeather.ismatching(S, u_grid) || throw(DimensionMismatch(S, u_grid))
@boundscheck SpeedyWeather.ismatching(S, U) || throw(DimensionMismatch(S, U))
g_north = scratch_memory.north
g_south = scratch_memory.south
SpeedyWeather.SpeedyTransforms._fourier!(u_grid, g_north, g_south, S)
return nothing
end
# this works
function transform_debug_with_kwargs_no_checks!(
vor_grid, u_grid, scratch_memory, U,
vor,
S;
kwargs... # now has an unused kwargs...
)
# but no @boundschecks
g_north = scratch_memory.north
g_south = scratch_memory.south
# this is a FFT, for which we have a custom rule defined
SpeedyWeather.SpeedyTransforms._fourier!(vor_grid, g_north, g_south, S)
g_north = scratch_memory.north
g_south = scratch_memory.south
SpeedyWeather.SpeedyTransforms._fourier!(u_grid, g_north, g_south, S)
return nothing
end
# this gives a segfault
function transform_debug_with_kwargs!(
vor_grid, u_grid, scratch_memory, U,
vor,
S;
kwargs... # still has an unused kwargs...
)
# boundschecks and kwarg together cause an error here now
@boundscheck SpeedyWeather.ismatching(S, vor_grid) || throw(DimensionMismatch(S, vor_grid))
@boundscheck SpeedyWeather.ismatching(S, vor) || throw(DimensionMismatch(S, vor))
# that's a spherical harmonics transforms (KA kernel + FFT)
g_north = scratch_memory.north
g_south = scratch_memory.south
# this is a FFT for which we have a custom rule defined
SpeedyWeather.SpeedyTransforms._fourier!(vor_grid, g_north, g_south, S)
@boundscheck SpeedyWeather.ismatching(S, u_grid) || throw(DimensionMismatch(S, u_grid))
@boundscheck SpeedyWeather.ismatching(S, U) || throw(DimensionMismatch(S, U))
g_north = scratch_memory.north
g_south = scratch_memory.south
SpeedyWeather.SpeedyTransforms._fourier!(u_grid, g_north, g_south, S)
return nothing
end
spectral_grid = SpectralGrid(trunc=9, nlayers=1)
S = SpectralTransform(spectral_grid)
#simulation = initialize!(model)
# these are RingGrids.Fields
vor_grid = rand(spectral_grid.grid, 1)
u_grid = rand(spectral_grid.grid, 1)
v_grid = rand(spectral_grid.grid, 1)
scratch_memory = deepcopy(S.scratch_memory)
# these are LowerTriangularArrays.LowerTriangularArray
vor = rand(ComplexF32, spectral_grid.spectrum, 1)
U = rand(ComplexF32, spectral_grid.spectrum, 1)
# this works
autodiff(Reverse, transform_debug!, Const, Duplicated(vor_grid, make_zero(vor_grid)), Duplicated(u_grid, make_zero(u_grid)), Duplicated(scratch_memory, make_zero(scratch_memory)), Duplicated(U, make_zero(U)), Duplicated(vor, make_zero(vor)), Const(S))
# this works as well
autodiff(Reverse, transform_debug_with_kwargs_no_checks!, Const, Duplicated(vor_grid, make_zero(vor_grid)), Duplicated(u_grid, make_zero(u_grid)), Duplicated(scratch_memory, make_zero(scratch_memory)), Duplicated(U, make_zero(U)), Duplicated(vor, make_zero(vor)), Const(S))
# this gives a SegFault
autodiff(Reverse, transform_debug_with_kwargs!, Const, Duplicated(vor_grid, make_zero(vor_grid)), Duplicated(u_grid, make_zero(u_grid)), Duplicated(scratch_memory, make_zero(scratch_memory)), Duplicated(U, make_zero(U)), Duplicated(vor, make_zero(vor)), Const(S))
I've updated the example above and I am honestly just really confused by it. This was fine before, but now seemingly some interaction between our custom rules for the FFT, a @boundscheck and an unused kwargs... causes a SegFault, but also just in a configuration like that, with calling _fourier! twice. In other contexts all of this is fine.
The custom rule is actually implemented here