Enzyme.jl icon indicating copy to clipboard operation
Enzyme.jl copied to clipboard

CI Integration Example SpeedyWeather.jl

Open maximilian-gelbrecht opened this issue 1 month ago • 21 comments

@vchuravy suggested I add an example using SpeedyWeather.jl to the integration tests.

The test is pretty much the same thing we do in the paper. It's a sensitivity analysis of a single grid point and it checks that this runs without errors and the gradient makes physical sense, so the gradient is localised around the selected grid point and small far away from it.

maximilian-gelbrecht avatar Nov 14 '25 11:11 maximilian-gelbrecht

Your PR requires formatting changes to meet the project's style guidelines. Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.
diff --git a/test/integration/SpeedyWeather/runtests.jl b/test/integration/SpeedyWeather/runtests.jl
index ea3e354c..c606e9fd 100644
--- a/test/integration/SpeedyWeather/runtests.jl
+++ b/test/integration/SpeedyWeather/runtests.jl
@@ -6,7 +6,7 @@
 using SpeedyWeather, Enzyme, Test
 
 spectral_grid = SpectralGrid(trunc = 32, nlayers = 8)             # define resolution
-model = PrimitiveWetModel(; spectral_grid, physics=false)         # construct model
+model = PrimitiveWetModel(; spectral_grid, physics = false)         # construct model
 # physics = false to accelate the test
 simulation = initialize!(model)
 initialize!(simulation)

github-actions[bot] avatar Nov 14 '25 11:11 github-actions[bot]

Thanks Max!

vchuravy avatar Nov 14 '25 15:11 vchuravy

@giordano can we make the filtering more aggressive and only run CI on the integration test being changed?

vchuravy avatar Nov 14 '25 15:11 vchuravy

We could generate the matrix dynamically, but only if test/integration/ files only have been modified, otherwise I think we want to run all the tests, right?

giordano avatar Nov 14 '25 15:11 giordano

BTW, this is missing adding SpeedyWeather to https://github.com/EnzymeAD/Enzyme.jl/blob/ab96c82667b3999ac624dfff6eaabdbc376d40cc/.github/workflows/Integration.yml#L47-L56

giordano avatar Nov 14 '25 15:11 giordano

Oh, yes. I just added it.

maximilian-gelbrecht avatar Nov 14 '25 15:11 maximilian-gelbrecht

Besides the fact Test isn't loaded, the jobs are timing out. Any way to have something lighter to test? All other jobs finish in less than 20 minutes (most of them are a lot faster)

giordano avatar Nov 14 '25 16:11 giordano

Yeah, in our own CI a very similar test takes about 35 minutes. I hoped it's a bit faster now after recent Enzyme patches, but it doesn't seem to be the case.

If 20 min is the limit (didn't know that), then I can turn off some of the parametrizations and see if that's sufficient. I am out of time to really experiment with that for today though.

maximilian-gelbrecht avatar Nov 14 '25 16:11 maximilian-gelbrecht

Timeout is 45 minutes https://github.com/EnzymeAD/Enzyme.jl/blob/ab96c82667b3999ac624dfff6eaabdbc376d40cc/.github/workflows/Integration.yml#L32 but all other jobs take less than half of that.

giordano avatar Nov 14 '25 16:11 giordano

I mean the gradient compile time is what it is currently 🤷 On my laptop and my HPC it's a bit faster, but in the GitHub Actions CI it's really quite slow.

maximilian-gelbrecht avatar Nov 14 '25 16:11 maximilian-gelbrecht

I updated the example to mirror more closely the one we already use in your CI in Speedy. It should be faster now, and there it always finishes < 40 min.

However, it also seems that Enzymev0.13.105 broke something for us, as we are getting fails from our CI tests on this since Friday. So this example here will also currently fail.

maximilian-gelbrecht avatar Nov 24 '25 10:11 maximilian-gelbrecht

well thats certainly a reason for getting the integration tests in!

wsmoses avatar Nov 25 '25 04:11 wsmoses

that said the erring part is a bit too complex/baked in, would you be able to construct a MWE of?

wsmoses avatar Nov 25 '25 06:11 wsmoses

Codecov Report

:x: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 67.78%. Comparing base (30e0519) to head (0b438af). :warning: Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/rules/customrules.jl 0.00% 4 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2784      +/-   ##
==========================================
- Coverage   67.79%   67.78%   -0.01%     
==========================================
  Files          58       58              
  Lines       20723    20726       +3     
==========================================
  Hits        14050    14050              
- Misses       6673     6676       +3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 25 '25 18:11 codecov[bot]

okay I think I fixed the custom rule issue immediately here (and a potential PR for more information on the 1.11: https://github.com/EnzymeAD/Enzyme/pull/2582).

However, on a debug julia I pretty much always hit GC errors. @vchuravy I think we really ought debug and fix that MWE speedweather GC error issue before we can land this

wsmoses avatar Nov 25 '25 20:11 wsmoses

Thanks for your effort on this already. I already had a debugging session today with this, but it's quite nasty to boil this down to a proper MWE. I know which intermediate function from our code causes the error, but each individual function it calls is fine. This is the function for our simplest model, and here's an example call with Enzyme, but in the next days I'll try to condense it to a proper MWE, it's just not that easy to boil this down so quickly.

maximilian-gelbrecht avatar Nov 25 '25 21:11 maximilian-gelbrecht

I continue to struggle a bit to come up with a MWE that doesn't use our data structures.

Below is the simplest function in our code that produces the error. Each of the individual functions I can differentiate just fine, it's the interplay between them that seemingly causes an error here.

using Enzyme, SpeedyWeather

function transform_debug!(
    diagn::DiagnosticVariables,   
    progn::PrognosticVariables,
    lf::Integer,
    model::Barotropic;
    kwargs...
)    
    # retrieve data
    (; vor_grid, u_grid, v_grid ) = diagn.grid
    (; scratch_memory) = diagn.dynamics  

    vor = get_step(progn.vor, lf)   # relative vorticity at leapfrog step lf
    U = diagn.dynamics.a            # reuse work arrays for velocities in spectral
    V = diagn.dynamics.b            # reuse work arrays for velocities in spectral
                                    # U = u*coslat, V=v*coslat
    S = model.spectral_transform

    # COMPUTE FUNCTIONS, WHEN At least 3 are active -> GC error 
    # that's  a spherical harmonics transforms (KA kernel + FFT)
    transform!(vor_grid, vor, scratch_memory, S)    # get vorticity on grid from spectral vor
    
    # that's a spatial derivative computed in spectral space (KA kernel) 
    SpeedyWeather.UV_from_vor!(U, V, vor, S)
    
    # that's a spherical harmonics transforms and some rescaling (KA kernel + FFT + KA kernel)
    transform!(u_grid, U, scratch_memory, S, unscale_coslat=true)
    #transform!(v_grid, V, scratch_memory, S, unscale_coslat=true)

    return nothing
end

spectral_grid = SpectralGrid(trunc=9, nlayers=1)
model = BarotropicModel(; spectral_grid)
simulation = initialize!(model)
progn, diagn, model = SpeedyWeather.unpack(simulation)

lf2 = 2 

dprogn = make_zero(progn)
ddiagn = make_zero(diagn)

autodiff(Reverse, transform_debug!, Const, Duplicated(diagn, ddiagn), Duplicated(progn, dprogn), Const(lf2), Const(model))

which causes

[29800] signal (11.2): Segmentation fault: 11
in expression starting at REPL[12]:1
_ZN4llvm5Value11setNameImplERKNS_5TwineE at /Users/max/.julia/juliaup/julia-1.10.10+0.aarch64.apple.darwin14/lib/julia/libLLVM.dylib (unknown line)
Allocations: 66874844 (Pool: 66783213; Big: 91631); GC: 92
Segmentation fault: 11

and was fine before with Enzyme <0.13.105

maximilian-gelbrecht avatar Nov 26 '25 16:11 maximilian-gelbrecht

Is #2818 related to this here? Just let me know if I should invest a bit more time for finding a really proper MWE, or if that's already it.

maximilian-gelbrecht avatar Nov 29 '25 14:11 maximilian-gelbrecht

No that's unrelated and an issue from an in progress PR which doesn't occur on main atm

wsmoses avatar Nov 29 '25 16:11 wsmoses

The whole issue just confuses me.

Now, I found that just leaving away the kwargs... in the function signature and the @boundscheck in the function makes it work. The individual functions are all fine to differentiate as well, and the custom rules that are used for the _fourier! function also pass their tests and are fine to differentiate in other contexts.

So in the following script, using some of SpeedyWeather's numerics I just present the just slightly different versions where the last one that includes an unused kwargs... and a @boundscheck causes a SegFault.

Julia 1.10, Enzyme v0.13.108, SpeedyWeather main

using Enzyme, SpeedyWeather
 
# this works 
function transform_debug!(
    vor_grid, u_grid, scratch_memory, U, 
    vor, 
    S
)    
    # catch incorrect sizes early
    @boundscheck SpeedyWeather.ismatching(S, vor_grid) || throw(DimensionMismatch(S, vor_grid))
    @boundscheck SpeedyWeather.ismatching(S, vor) || throw(DimensionMismatch(S, vor))

    g_north = scratch_memory.north 
    g_south = scratch_memory.south 

    # this is a FFT, for which we have a custom rule defined 
    SpeedyWeather.SpeedyTransforms._fourier!(vor_grid, g_north, g_south, S)

    # catch incorrect sizes early
    @boundscheck SpeedyWeather.ismatching(S, u_grid) || throw(DimensionMismatch(S, u_grid))
    @boundscheck SpeedyWeather.ismatching(S, U) || throw(DimensionMismatch(S, U))

    g_north = scratch_memory.north 
    g_south = scratch_memory.south 

    SpeedyWeather.SpeedyTransforms._fourier!(u_grid, g_north, g_south, S)

    return nothing
    
end

# this works 
function transform_debug_with_kwargs_no_checks!(
    vor_grid, u_grid, scratch_memory, U, 
    vor, 
    S; 
    kwargs... # now has an unused kwargs...
)   
    # but no @boundschecks 
    g_north = scratch_memory.north 
    g_south = scratch_memory.south 

    # this is a FFT, for which we have a custom rule defined 
    SpeedyWeather.SpeedyTransforms._fourier!(vor_grid, g_north, g_south, S)

    g_north = scratch_memory.north 
    g_south = scratch_memory.south 

    SpeedyWeather.SpeedyTransforms._fourier!(u_grid, g_north, g_south, S)

    return nothing
end

# this gives a segfault 
function transform_debug_with_kwargs!(
    vor_grid, u_grid, scratch_memory, U, 
    vor, 
    S; 
    kwargs... # still has an unused kwargs...
)   
    # boundschecks and kwarg together cause an error here now 
    @boundscheck SpeedyWeather.ismatching(S, vor_grid) || throw(DimensionMismatch(S, vor_grid))
    @boundscheck SpeedyWeather.ismatching(S, vor) || throw(DimensionMismatch(S, vor))

    # that's  a spherical harmonics transforms (KA kernel + FFT)
    g_north = scratch_memory.north 
    g_south = scratch_memory.south 

    # this is a FFT for which we have a custom rule defined
    SpeedyWeather.SpeedyTransforms._fourier!(vor_grid, g_north, g_south, S)

    @boundscheck SpeedyWeather.ismatching(S, u_grid) || throw(DimensionMismatch(S, u_grid))
    @boundscheck SpeedyWeather.ismatching(S, U) || throw(DimensionMismatch(S, U))

    g_north = scratch_memory.north 
    g_south = scratch_memory.south 

    SpeedyWeather.SpeedyTransforms._fourier!(u_grid, g_north, g_south, S)

    return nothing
end

spectral_grid = SpectralGrid(trunc=9, nlayers=1)
S = SpectralTransform(spectral_grid)
#simulation = initialize!(model)

# these are RingGrids.Fields 
vor_grid = rand(spectral_grid.grid, 1)
u_grid = rand(spectral_grid.grid, 1)
v_grid = rand(spectral_grid.grid, 1)
scratch_memory = deepcopy(S.scratch_memory)

# these are LowerTriangularArrays.LowerTriangularArray
vor = rand(ComplexF32, spectral_grid.spectrum, 1)  
U =  rand(ComplexF32, spectral_grid.spectrum, 1)  

# this works 
autodiff(Reverse, transform_debug!, Const, Duplicated(vor_grid, make_zero(vor_grid)), Duplicated(u_grid, make_zero(u_grid)), Duplicated(scratch_memory, make_zero(scratch_memory)), Duplicated(U, make_zero(U)), Duplicated(vor, make_zero(vor)), Const(S))

# this works as well 
autodiff(Reverse, transform_debug_with_kwargs_no_checks!, Const, Duplicated(vor_grid, make_zero(vor_grid)), Duplicated(u_grid, make_zero(u_grid)), Duplicated(scratch_memory, make_zero(scratch_memory)), Duplicated(U, make_zero(U)), Duplicated(vor, make_zero(vor)), Const(S))

# this gives a SegFault
autodiff(Reverse, transform_debug_with_kwargs!, Const, Duplicated(vor_grid, make_zero(vor_grid)), Duplicated(u_grid, make_zero(u_grid)), Duplicated(scratch_memory, make_zero(scratch_memory)), Duplicated(U, make_zero(U)), Duplicated(vor, make_zero(vor)), Const(S))

maximilian-gelbrecht avatar Dec 05 '25 15:12 maximilian-gelbrecht

I've updated the example above and I am honestly just really confused by it. This was fine before, but now seemingly some interaction between our custom rules for the FFT, a @boundscheck and an unused kwargs... causes a SegFault, but also just in a configuration like that, with calling _fourier! twice. In other contexts all of this is fine.

The custom rule is actually implemented here

maximilian-gelbrecht avatar Dec 10 '25 11:12 maximilian-gelbrecht