Oceananigans.jl icon indicating copy to clipboard operation
Oceananigans.jl copied to clipboard

`PerturbationAdvectionOpenBoundaryCondition` + `ConjugateGradientSolver` causes simulation to blow-up with versions 0.96.20+

Open tomchor opened this issue 7 months ago • 19 comments

After 0.96.20, I've seen some scripts that were previously running lead to NaNs when using PerturbationAdvectionOpenBoundaryCondition with ConjugateGradientSolver on a ImmersedBoundaryGrid, even with extremely simple bathymetries. For example, the simulation below is a domain with a fake flat bottom. It runs fine on 0.96.19, but it blows up with 0.96.20:

using Oceananigans
using Oceananigans.BoundaryConditions: PerturbationAdvectionOpenBoundaryCondition
using Oceananigans.Solvers: ConjugateGradientPoissonSolver

Lx, Lz = 10, 3
Nx = Nz = 8

grid_base = RectilinearGrid(topology = (Bounded, Flat, Bounded), size = (Nx, Nz), x = (0, Lx), z = (0, Lz))
flat_bottom(x) = 1
grid = ImmersedBoundaryGrid(grid_base, PartialCellBottom(flat_bottom))
U = 1
inflow_timescale = 1e-4
outflow_timescale = Inf
u_boundaries = FieldBoundaryConditions(
    west   = PerturbationAdvectionOpenBoundaryCondition(U; inflow_timescale, outflow_timescale),
    east   = PerturbationAdvectionOpenBoundaryCondition(U; inflow_timescale, outflow_timescale),)

model = NonhydrostaticModel(; grid, boundary_conditions = (; u = u_boundaries),
                              pressure_solver = ConjugateGradientPoissonSolver(grid, maxiter=5),
                              advection = WENO(; grid, order=5))
set!(model, u=U)

Δt = 0.5 * minimum_zspacing(grid) / abs(U)
simulation = Simulation(model; Δt = Δt, stop_time=10)

conjure_time_step_wizard!(simulation, IterationInterval(1), cfl = 0.1)

run!(simulation)

In the particular example above there are some tweaks I can make to get it to run, but that's not possible for my production-ready scripts.

I'm assuming this is due to https://github.com/CliMA/Oceananigans.jl/pull/4041, but I can't figure out how to fix this behavior.

cc @amontoison @glwagner

tomchor avatar Apr 29 '25 14:04 tomchor

@tomchor The issue is the preconditioner used in ConjugateGradientPoissonSolver. I changed them a little bit to be SPD (mandatory for Krylov methods) in #4041.

Maybe the modifications to make it SPD is not relevant for some problems and should be done based on the physics of the problem.

amontoison avatar Apr 29 '25 15:04 amontoison

I feel like this hints at some deeper underlying issue, is that possible?

glwagner avatar Apr 29 '25 18:04 glwagner

Yes, maybe we could keep the offset with the mean if it doesn't make the preconditioner indefinite?

We need to better understand the origin of the linear system and the underlying properties of the linear operators.

amontoison avatar Apr 29 '25 19:04 amontoison

@tomchor The issue is the preconditioner used in ConjugateGradientPoissonSolver. I changed them a little bit to be SPD (mandatory for Krylov methods) in #4041.

Maybe the modifications to make it SPD is not relevant for some problems and should be done based on the physics of the problem.

I see. I don't understand much about that, but it's worrisome to me that this extremely simple example (simply a flat bottom) blows up. In fact, I haven't been able to run a single one of my scripts with open boundaries and the new solver after the change. While making the change to Krylov, did you run any simulations with open boundaries that I can take a look at? Perhaps there's something to be learned from the scripts that did work for you.

tomchor avatar Apr 29 '25 19:04 tomchor

We added a new solver based on Krylov.jl but didn't changed the ConjugateGradientPoissonSolver. The only thing that I modified and could impact this solver is the preconditioner.

amontoison avatar Apr 29 '25 20:04 amontoison

pressure_solver = ConjugateGradientPoissonSolver(grid, maxiter=5)

I think maxiter=5 is rather optimistic. I would be happy if an iterative solver would need only 10 - 15 iterations (probably this is not the issue, still, 5 iterations is rather low for a CG method even with a good preconditioner)

simone-silvestri avatar Apr 29 '25 22:04 simone-silvestri

We've seen convergence in as few as 3 iterations with the FFT-based preconditioner.

glwagner avatar Apr 29 '25 23:04 glwagner

We added a new solver based on Krylov.jl but didn't changed the ConjugateGradientPoissonSolver. The only thing that I modified and could impact this solver is the preconditioner.

Thanks for the clarification. Let me rephrase my question. While making the change to the new preconditioner, did you run any simulations with open boundaries that I can take a look at? Are there changes to the preconditioner that we can try? (I'm assuming we can't go back to the previous preconditioner, correct?)

tomchor avatar Apr 30 '25 13:04 tomchor

Maybe you can try with no preconditioner? It might be slower but it should not crash.

simone-silvestri avatar Apr 30 '25 14:04 simone-silvestri

We added a new solver based on Krylov.jl but didn't changed the ConjugateGradientPoissonSolver. The only thing that I modified and could impact this solver is the preconditioner.

Thanks for the clarification. Let me rephrase my question. While making the change to the new preconditioner, did you run any simulations with open boundaries that I can take a look at? Are there changes to the preconditioner that we can try? (I'm assuming we can't go back to the previous preconditioner, correct?)

Why are you assuming that?

We can make whatever chnages that are needed to get this solver working. Or my strongly, we must make whatever changes are needed to get this solver working.

@tomchor, the problem is that your case is not tested in CI. Please open a PR to add this test. That test will ensure that the preconditioner continues to work for simple situations like the one you are describing. Even better, with a flat bottom we have an exact solver (by changing the domain size). A good test would compare the PCG in an expanded domain + immersed boundary against the exact solver. I also think we can add a test in a more complicated case which shows that the default setting produces convergence.

I would add by the way, we can definitely go back to a previous algorithm, but the actual and important question is why the preconditioner is not converging. We are limited in the actions we can take without understanding the problem.

glwagner avatar Apr 30 '25 14:04 glwagner

We added a new solver based on Krylov.jl but didn't changed the ConjugateGradientPoissonSolver. The only thing that I modified and could impact this solver is the preconditioner.

Thanks for the clarification. Let me rephrase my question. While making the change to the new preconditioner, did you run any simulations with open boundaries that I can take a look at? Are there changes to the preconditioner that we can try? (I'm assuming we can't go back to the previous preconditioner, correct?)

Why are you assuming that?

Because according to @amontoison, he "changed them a little bit to be SPD (mandatory for Krylov methods) in #4041". That's in https://github.com/CliMA/Oceananigans.jl/issues/4451#issuecomment-2839268397

@tomchor, the problem is that your case is not tested in CI. Please open a PR to add this test. That test will ensure that the preconditioner continues to work for simple situations like the one you are describing. Even better, with a flat bottom we have an exact solver (by changing the domain size). A good test would compare the PCG in an expanded domain + immersed boundary against the exact solver. I also think we can add a test in a more complicated case which shows that the default setting produces convergence.

I haven't opened a PR because I have no idea how to solve this issue. I figured once we know how to solve it, we can open a PR and I'll add tests to make sure it doesn't break again.

tomchor avatar Apr 30 '25 14:04 tomchor

Because according to @amontoison, he "changed them a little bit to be SPD (mandatory for Krylov methods) in https://github.com/CliMA/Oceananigans.jl/pull/4041". That's in https://github.com/CliMA/Oceananigans.jl/issues/4451#issuecomment-2839268397

But Krylov can also be changed to fix this. Everything is always on the table.

glwagner avatar Apr 30 '25 15:04 glwagner

The PR we need will add the test.

glwagner avatar Apr 30 '25 15:04 glwagner

So, I'm trying to look into this, and I think the initial condition may be working against a solution. So, in the example above (a flat bottom in the lower 1/3 of the domain), after issuing set!(model, u=1), this is what I get in return (without compressibility being enforced):

9×8 view(::Array{Float64, 3}, 5:13, 1, 5:12) with eltype Float64:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

Plotting the data (not the Field, in order to bypass extra masking) visually, it looks like this:

Image

This is not what I expect set!() to do here. I expected u to be 1 everywhere except in the bottom, where it would be masked to zero in the data. Instead the west boundary is all zeros and the east boundary is all ones. I traced this behavior to these methods https://github.com/CliMA/Oceananigans.jl/blob/60ab962db62450dcc79ffc3262c6b23398f4a86d/src/ImmersedBoundaries/mask_immersed_field.jl#L53-L64

and it seems like the _mask_immersed_field() isn't being dispatched in this case to grid.Nx+1. It stops at grid.Nx. Is this expected behavior?

In summary, I suspect this is making it harder for the pressure solver to converge to a solution. An indeed if I enforce incompressibility and use the FFT solver, I get a reasonable solution but which ignores the immersed boundary. If I enforce incompressibility with the CG solver, then I get velocities on the order of $10^{11}$.

tomchor avatar May 21 '25 17:05 tomchor

Try changing :xyz to size(field)

glwagner avatar May 21 '25 18:05 glwagner

That does indeed dispatch the kernel up to grid.Nx+1. Turns out that doesn't change the masking though. Part of the issue is that we're using immersed_peripheral_node to _mask_immersed_field!, which masks (in this particular case) fields like this (yellow is a true value, purple is a false value):

Image

thus excluding the edges. Maybe we should be using immersed_inactive_node, which masks fields like

Image

which seems much more correct at least in this example.

Upon further verification, it turns out that this doesn't actually lead to much better results using the CG solver (although there seems to be some improvement), but I do think it's a better way to do things regardless. I can create a different issue to investigate this if you guys prefer.

One other issue that I mentioned in https://github.com/CliMA/Oceananigans.jl/issues/4451#issuecomment-2898725154 and that is worth investigating is that velocities at i=1 are zero, while velocities at i=Nx+1 are one. I'm not sure why that is but I'll investigate tomorrow. That can't be helping the solver...

tomchor avatar May 21 '25 22:05 tomchor

immersed_inactive_node doesn't mask wall-normal velocity fields, so it doesn't achieve the primary objective of mask_immersed_field!, right? What am I missing there?

glwagner avatar May 21 '25 22:05 glwagner

I don't completely grasp the issue about the N+1 nodes. These are on the periphery, but not the immersed_periphery. There's some issue regarding open boundaries where we don't want to touch the non-immersed-peripheral nodes but its a little hazy to me, cc @jagoosw .

All this complexity points to a deficiency in the overall design I think. I'm struggling to grasp the issues or what we want to do.

glwagner avatar May 21 '25 22:05 glwagner

I'm struggling to follow exactly what's going on here and I can't reproduce the field in this comment:

So, I'm trying to look into this, and I think the initial condition may be working against a solution. So, in the example above (a flat bottom in the lower 1/3 of the domain), after issuing set!(model, u=1), this is what I get in return (without compressibility being enforced):

When I do that I get:

julia> model.velocities.u[:, 1, :]
17×16 OffsetArray(::Matrix{Float64}, -3:13, -3:12) with eltype Float64 with indices -3:13×-3:12:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

That being said, I think this problem comes from initialising the field. I have noticed this before but hadn't worked out exactly what is going on but should have opened an issue anyway.

The problem, I think, is that because of the work around we put in because the i = 1 index is stepped (but shouldn't be) where the boundary value is copied to i = 0 then it also needs to be initialised:

https://github.com/CliMA/Oceananigans.jl/blob/8a8e4aa1417958e880a037549db2185a5de8de1d/src/BoundaryConditions/perturbation_advection_open_boundary_matching_scheme.jl#L134-L144

So (for this workaround) we would have to do model.velocities.u[:, 1, :] .= U and this works and doesn't NaN for me (I checked and you can do model.velocities.u[0:grid.Nx+1, 1, :] .= U so it is the i=0 position that is the problem.

jagoosw avatar May 23 '25 09:05 jagoosw