Oceananigans.jl Reset model clock and skip `time_step!`ing if next actuation time is tiny

Closes https://github.com/CliMA/Oceananigans.jl/issues/3593

This is somewhat of a hack to investigate (and eventually solve) https://github.com/CliMA/Oceananigans.jl/issues/3593. For now I just want to investigate if this breaks too much stuff in the code, but it does seem to solve the problem. Here's a MWE to demonstrate:

using Oceananigans

grid_base = RectilinearGrid(topology = (Bounded, Periodic, Bounded),
                            size = (16, 20, 4), extent = (800, 1000, 100),)
    
@inline east_wall(x, y, z) = x > 400
grid = ImmersedBoundaryGrid(grid_base, GridFittedBoundary(east_wall))

model = NonhydrostaticModel(grid = grid, timestepper = :RungeKutta3, buoyancy = BuoyancyTracer(), tracers = :b,)

N² = 6e-6
b∞(x, y, z) = N² * z
set!(model, b=b∞)
    
simulation = Simulation(model, Δt=25, stop_time=1e4,)

using Statistics: std
using Printf
progress_message(sim) = @printf("Iteration: %04d, time: %s, iteration×Δt: %s, std(pNHS) = %.2e\n",
                                iteration(sim), sim.model.clock.time, iteration(sim) * sim.Δt, std(model.pressures.pNHS))
add_callback!(simulation, progress_message, IterationInterval(1))

simulation.output_writers[:snaps] = NetCDFOutputWriter(model, (; model.pressures.pNHS,),
                                                       filename = "test_pressure.nc",
                                                       schedule = TimeInterval(100),
                                                       overwrite_existing = true,)
run!(simulation)

On main this produces stuff like:

Iteration: 0001, time: 25.0, iteration×Δt: 25.0, std(pNHS) = 6.02e-03
Iteration: 0002, time: 50.0, iteration×Δt: 50.0, std(pNHS) = 6.02e-03
Iteration: 0003, time: 75.0, iteration×Δt: 75.0, std(pNHS) = 6.02e-03
Iteration: 0004, time: 99.99999999999999, iteration×Δt: 100.0, std(pNHS) = 6.02e-03
Iteration: 0005, time: 100.0, iteration×Δt: 125.0, std(pNHS) = 2.72e+10

The last two lines are of note where we went from time: 99.99999999999999 to time: 100.0, implying a very tiny time-step, which results in a weird pressure field, as quantified by the last output of the last line: std(pNHS) = 2.72e+10. Note that because of this, time and iteration×Δt don't match up anymore in the last line. Namely time: 100.0, iteration×Δt: 125.0. This "misstep" happens many times throughout the run on main.

On this branch this doesn't happen anymore, and even after many time-steps things remain aligned (albeit with a very small round-off error):

Iteration: 0396, time: 9900.0, iteration×Δt: 9900.0, std(pNHS) = 5.99e-03
Iteration: 0397, time: 9925.000000000002, iteration×Δt: 9925.0, std(pNHS) = 5.99e-03
Iteration: 0398, time: 9950.000000000004, iteration×Δt: 9950.0, std(pNHS) = 5.99e-03
Iteration: 0399, time: 9975.000000000005, iteration×Δt: 9975.0, std(pNHS) = 5.99e-03

Ideally the way to really fix this would be to figure out a way to avoid round-off errors, but I haven't been able to do that yet.

May 24 '24 00:05 tomchor

Nice work! I'm curious about the criteria. Should it be something like

dt = 10 * eps(dt) * sim.dt

? Or does it have to be larger than that (hence the factor 1e10).

It'd be nice not to have to define next_actuation_time for every schedule... it doesn't really make sense for WallTimeInterval either. Plus, we want users to be able to provide custom schedules (since they only need to be a function of model that returns true/false) so that people can trigger output / action using interesting custom criteria...

May 24 '24 21:05 glwagner

Nice work! I'm curious about the criteria. Should it be something like
dt = 10 * eps(dt) * sim.dt
? Or does it have to be larger than that (hence the factor 1e10).

I actually don't know what the proper criterion should be. With the one you proposed, the error doesn't go away in this example since the tiny time-step is about 1e-12, but 10 * eps(dt) * sim.dt come out to be about 1e-13. If we use 100 * eps(dt) * sim.dt then it works. But I don't yet know how much of this will generalize to other, more complex simulations. I still have to test these on my own simulations to see what works.

It'd be nice not to have to define next_actuation_time for every schedule... it doesn't really make sense for WallTimeInterval either. Plus, we want users to be able to provide custom schedules (since they only need to be a function of model that returns true/false) so that people can trigger output / action using interesting custom criteria...

Yeah, agree. I'm not sure of a good workaround here though. Do you have suggestions?

For the time being we can just set a fallback method as next_actuation_time(scheduke) = Inf I guess? (Similar to what I did for IterationInterval.

Also, nice to see that tests pass and nothing is breaking :)

May 24 '24 22:05 tomchor

I did a few tests with some criteria for timestep-skipping with a couple of my own simulations in addition to the MWE included here. In summary:

Criterion sim.Δt / 1e10: successfully gets rids of the problem in both the MWE and in my simulations
Criterion 10 * eps(sim.Δt) * sim.Δt: doesn't get rid of the problem in any simulation
100 * eps(sim.Δt) * sim.Δt: fixes the problem in the MWE but not in my simulations, although it does decrease its frequency of occurrence a good amount.
1000 * eps(sim.Δt) * sim.Δt: fixes everything in all simulations I've tried.

So only options 1 and 4 fully fix the problem (at least in the simulations I've tried so far). For me both those options rely on pretty arbitrary numbers though, so I'm not very happy with neither. From the point of view seeing the timestep-skipping as an approximation ($u^{n+1} \approx u^n$), then maybe criterion 1 makes more sense, although I'm not sure how it'd behave for Float32 simulations.

I see three possible ways to go about it right now:

Do what this PR is doing, and manually set the criterion to either option 1 or 4 above. If it turns out that some simulations still have issues, we revisit.
We add min_Δt as a property of NonhydrostaticModel (or maybe Simulation?). I think the minimum Δt for which time skipping will be necessary will vary significantly between simulations, so this solution deals with that by leaving the decision up to the user if they are interested in the pressure output.
We try something that actually prevents these round-off errors instead of dealing with them. @glwagner suggested an Integer-based model clock, but there might be other options.

May 27 '24 22:05 tomchor

I did a few tests with some criteria for timestep-skipping with a couple of my own simulations in addition to the MWE included here. In summary:

Criterion sim.Δt / 1e10: successfully gets rids of the problem in both the MWE and in my simulations

Criterion 10 * eps(sim.Δt) * sim.Δt: doesn't get rid of the problem in any simulation

100 * eps(sim.Δt) * sim.Δt: fixes the problem in the MWE but not in my simulations, although it does decrease its frequency of occurrence a good amount.

1000 * eps(sim.Δt) * sim.Δt: fixes everything in all simulations I've tried.

So only options 1 and 4 fully fix the problem (at least in the simulations I've tried so far). For me both those options rely on pretty arbitrary numbers though, so I'm not very happy with neither. From the point of view seeing the timestep-skipping as an approximation (un+1≈un), then maybe criterion 1 makes more sense, although I'm not sure how it'd behave for Float32 simulations.

I see three possible ways to go about it right now:

Do what this PR is doing, and manually set the criterion to either option 1 or 4 above. If it turns out that some simulations still have issues, we revisit.

We add min_Δt as a property of NonhydrostaticModel (or maybe Simulation?). I think the minimum Δt for which time skipping will be necessary will vary significantly between simulations, so this solution deals with that by leaving the decision up to the user if they are interested in the pressure output.

We try something that actually prevents these round-off errors instead of dealing with them. @glwagner suggested an Integer-based model clock, but there might be other options.

Note that eps(sim.Δt) is similar to sim.Δt * eps(typeof(Δt)). So Δt / 1e10 is pretty similar to 100000 * eps(sim.Δt). The only point of using eps is to avoid hard coding Float64.

May 28 '24 23:05 glwagner

@glwagner are you okay if I just add min_Δt as a property of NonhydrostaticModel and maintain the strategy of skipping the timestep is Δt is smaller than that? I think that's a reasonable and simple way to fix this.

Jun 05 '24 16:06 tomchor

This isn't a problem of the nonhydrostatic model specifically. Why would we add it as a property there?

Jun 05 '24 16:06 glwagner

This isn't a problem of the nonhydrostatic model specifically. Why would we add it as a property there?

This PR is supposed to fix the output of the pressure, which can be unreasonably large if the time step is really small due to, I believe, the nonhydrostatic pressure correction, which is specific to the NonhydrostaticModel. Unless I'm missing something.

This wouldn't necessarily fix https://github.com/CliMA/Oceananigans.jl/issues/3056, which may be what you're thinking of. I can also add min_Δt to Simulation, so that this can be used with other models.

Jun 05 '24 16:06 tomchor

I think we should fix the problem once. Otherwise we'll end up with unnecessary code somewhere that has to be deleted.

Jun 05 '24 16:06 glwagner

I think we should fix the problem once. Otherwise we'll end up with unnecessary code somewhere that has to be deleted.

@glwagner Can you please be clearer? Does that mean adding min_Δt to Simulation is an acceptable solution? Or should we try to avoid these round-off errors to even happen in the first place?

Jun 05 '24 17:06 tomchor

I think the solution discussed here, where the time-step change associated with TimeInterval schedules is restricted by a sort of relative tolerance criteria, is acceptable if we can't tease out the underlying issue (or its unsolvable).

If we could indeed solve the problem simply by eliminating round off error, then this would almost certainly be preferred since it might be much simpler (eg just fixing an floating-point-unstable arithmetic operation by rearranging terms). That could be really easy.

@Sbozzolo might be able to help because I believe they do something special to avoid round off issues in ClimaAtmos.

I would hesitate to establish an absolute min_Δt that's independent of the units being used, unless the default is 0.

Jun 06 '24 11:06 glwagner

I'll try to get going with the MWE here. Is the immersed boundary acually part of the MWE (ie the error does not occur with it?) And buoyancy?

Jun 06 '24 15:06 glwagner

It seems that without the immersed boundary, there is still a problem of super small time-steps, but the pressure does not blow up.

Jun 06 '24 15:06 glwagner

Ok getting closer maybe...

I think this problem is generic and cannot be solved in general for arbitrary time steps. Here's a few thoughts:

Reading about Kahan summation makes it clear that we simply cannot avoid errors if we would like to add a small floating point number (the time step) to a very large number (the model time).
I think the issue with the time-step is whether or not we can compute the RHS of the pressure Poisson equation accurately --- which is div(u') / Δt, where u' = u + Δt * Gu is the predictor velocity and div is the divergence. This is interesting, because I could not figure out why we would ever find large div(u') with small Δt even in this MWE. But now I realize that because of the status of the immersed Poisson solver, the velocity along the boundary is divergent, strongly so. So, div(u') is large along the boundary. And when we divide by Δt we get something huge. The magnitude of div(u') also somehow seems to depend on the time step (as does the magnitude of the spurious circulation). The correct solution to this case remains at rest of course. (An aside is that this problem could be avoided by separately computing the hydrostatic pressure, and then using a special horizontal gradient operators that avoid computing a hydrostatic pressure gradient across an immersed boundary. However, this would only be correct for no-flux boundary conditions on buoyancy on side walls). Anyways, apparently because of this issue with the immersed pressure solver, it seems that div(u') is large (because div(u) is large) even when Δt = O(1e-14)...
As a result of all of this I am confused about whether this MWE is actually reliable for debugging the issue. I guess we should expect to see problems simply when Δt = O(eps) because this is when div(u') / Δt cannot be reliably computed, I think. This leads to a fairly simple criteria for the time step that's compatible with the pressure correction. But as noted in this PR, this is not enough to fix issues with the immersed boundary MWE... but whether or not that is because of problems with the setup itself, I'm not sure...
All of that said, taking @tomchor suggestion to be more careful in updating the clock for RK3 actually does solve the MWE here. Obviously, this is again addressing the (in principle not entirely solvable) issue of error accumulation in clock.time, rather than addressing the other issue with very small time-steps producing an ill-posed pressure correction. I think we should fix RK3 separately, basically because we cannot completely avoid accumulating error in clock.time, every little thing we do to make it more accurate is a good idea.

Jun 06 '24 17:06 glwagner

PR #3617 helps with this MWE

Jun 06 '24 17:06 glwagner

@Sbozzolo might be able to help because I believe they do something special to avoid round off issues in ClimaAtmos.

With regards to floating point instabilities due to arithmetic with time and time intervals, ultimately, we will be solving this issue (and others) by moving away from a floating point time in favor of an integer one (e.g., milliseconds from a start date). As a fun fact, if you are using Float32 and keep track of your time in seconds, t + dt is going to have numerical error after approximately one year of simulated time (which is not that much).

Jun 06 '24 18:06 Sbozzolo

Looks good, maybe bump the minor version

Dec 14 '24 19:12 glwagner

Looks good, maybe bump the minor version

Done! I wanna test this code in my actual simulation before sending it off for reviews though.

Also, I'm thinking of using the MWE at the top comment as a test. It's the most reliable way I could reproduce this issue. Thoughts?

Dec 14 '24 20:12 tomchor

Looks good, maybe bump the minor version

Done! I wanna test this code in my actual simulation before sending it off for reviews though.

Also, I'm thinking of using the MWE at the top comment as a test. It's the most reliable way I could reproduce this issue. Thoughts?

I would just put in a very simple test that time-stepping is skipped correctly, by manually taking two time-steps and changing dt in between.

I think the kind of test you are describing may not be appropriate for CI; it's more a validation test. I think if you can put a unit test to show the feature works correctly, you can later show that using the feature solves pressure solver issues and be happy that the unit test ensures it will continue to work as in the validation test.

Dec 14 '24 20:12 glwagner

I tested this branch on my simulation and things work fine. However, when trying to include a test I came across some weird behavior. Namely the snippet below fails:

using Oceananigans
using Test

grid  = RectilinearGrid(size=(1, 1, 1), extent=(1, 1, 1))
model = NonhydrostaticModel(; grid)

simulation = Simulation(model, Δt=3, stop_iteration=1)
run!(simulation)

stop_time = 1.0
simulation = Simulation(model, Δt=1; stop_time)
run!(simulation)

@test simulation.model.clock.time == stop_time

with

Test Failed at REPL[11]:1
  Expression: simulation.model.clock.time == stop_time
   Evaluated: 4.0 == 1.0

If I re-build model before re-creating a simulation, then it works:

using Oceananigans
using Test

grid  = RectilinearGrid(size=(1, 1, 1), extent=(1, 1, 1))
model = NonhydrostaticModel(; grid)

simulation = Simulation(model, Δt=3, stop_iteration=1)
run!(simulation)

stop_time = 1.0
model = NonhydrostaticModel(; grid)
simulation = Simulation(model, Δt=1; stop_time)
run!(simulation)

@test simulation.model.clock.time == stop_time

So there seems to be some attribute of model (presumably model.clock?) that's leading to different Simulation behavior. Is this expected? I couldn't quite figure out what was happening.

Dec 15 '24 13:12 tomchor

So there seems to be some attribute of model (presumably model.clock?) that's leading to different Simulation behavior. Is this expected? I couldn't quite figure out what was happening.

Indeed, the clock is not reset to zero when the simulation is built, so you are restarting from model.clock == 3.0.

Dec 15 '24 14:12 simone-silvestri

So there seems to be some attribute of model (presumably model.clock?) that's leading to different Simulation behavior. Is this expected? I couldn't quite figure out what was happening.

Indeed, the clock is not reset to zero when the simulation is built, so you are restarting from model.clock == 3.0.

Ah, true! I'm ashamed I forgot about that. Although, in this case shouldn't the simulation just not iterate and keep the clock at 3?

Dec 15 '24 14:12 tomchor

@wenegrat @zhazorken This should fix the pressure issues when writing to disk. All you need to do is add the keyword argument minimum_relative_step=1e-10 (or some other very small positive number) when constructing a Simulation. Let me know if you have any issues.

Dec 16 '24 19:12 tomchor

Oceananigans.jl Oceananigans.jl copied to clipboard

Reset model clock and skip `time_step!`ing if next actuation time is tiny

Oceananigans.jl
Oceananigans.jl copied to clipboard