Oceananigans.jl icon indicating copy to clipboard operation
Oceananigans.jl copied to clipboard

CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS) when using Lagrangian particles under large CFL number

Open Yixiao-Zhang opened this issue 9 months ago • 26 comments

In the example below, the model crashes reporting a GPU illegal memory access error. The CFL number is intentionally set to a large value, under which the model will encounter numerical instability. I expect this model should abort itself when NANs appear instead of crashing due to a memory illegal access error. Besides, this only happens when I use Lagrangian particles. If not, the model will terminate by itself as I expect. I have also verified that the model does not crash when the CFL number is small.

using Oceananigans

const Lx = 1.0
const Nx = 50
const Δx = Lx / Nx
const max_velocity = 1.0
const cfl = 10.0
const Δt = cfl * Δx / max_velocity

function initial_u(x::R, y::R, z::R) where {R<:Real}
    return (max_velocity / Lx) * y
end

grid = RectilinearGrid(
    GPU(),
    size = (Nx, Nx, Nx),
    x = (0.0, Lx),
    y = (0.0, Lx),
    z = (0.0, Lx),
    topology = (Periodic, Bounded, Bounded)
)

arch_array = Oceananigans.Architectures.array_type(GPU()){Float64}
n_particles = 1000

xs = convert(arch_array, zeros((n_particles, )))
ys = convert(arch_array, LinRange(0.0, Lx, n_particles))
zs = convert(arch_array, zeros((n_particles, )))

particles = LagrangianParticles(x = xs, y = ys, z = zs)

model = NonhydrostaticModel(;
    grid,
    particles = particles,
)

set!(model, u = initial_u)

simulation = Simulation(model; Δt = Δt, stop_iteration = 200)

run!(simulation)

The output.log is uploaded as a file.

Test environment:

  • Julia version: v1.9.3
  • Oceananigans: v0.89.0
  • Tested on Ubuntu 20.04.6 LTS with CUDA 12.0 and MIT Satori with CUDA 11.4

This example tries to reproduce some of my simulations for convection. In these simulation, I used strong heating, and therefore I expect some of them to crash. However, I did not expect that they would trigger GPU illegal memory access errors.

This issue is probably related to #3267.

Yixiao-Zhang avatar Oct 06 '23 23:10 Yixiao-Zhang