ClimaAtmos.jl
ClimaAtmos.jl copied to clipboard
Moist Held-Suarez is qualitatively different between CPU and GPU
See stand-alone Moist held-suarez atmos runs here: https://buildkite.com/clima/climacoupler-longruns/builds/514
Running with this script, which mimics the atmos driver.
Some plots (100d, i.e. 10 10-day averages):
Zonal mean wind
Zonal mean temperature
1st level temperature
Instantaneous 100d rhoe_tot (first level)
I think that the first thing that we should do in narrowing down this issue, is making sure that we get the same result within machine precision between CPU and GPU. A few places where we could have differences is where the order of operations could differ between the two implementations, in particular, reductions and DSS.
Thanks, @charleskawczynski , this sounds like a good way forward.
@juliasloan25 has already done some machine precision tests of atmos states, and she's seen departures after the first timestep (https://github.com/CliMA/ClimaCoupler.jl/pull/614). I don't think we've looked any deeper yet though. Do we already have any CPU-GPU consistency tests in ClimaCore? I thought someone's mentioned them recently but I'm not totally sure where to look.
Here is one place in ClimaCore where I'd be surprised to see bitwise equality:
function reduce_cuda_blocks_kernel!(
reduce_cuda::AbstractArray{T, 2},
op,
::Val{shmemsize},
) where {T, shmemsize}
blksize = blockDim().x
fidx = blockIdx().x
tidx = threadIdx().x
nitems = size(reduce_cuda, 1)
nloads = cld(nitems, blksize) - 1
reduction = CUDA.CuStaticSharedArray(T, shmemsize)
reduction[tidx] = reduce_cuda[tidx, fidx]
for i in 1:nloads
idx = tidx + blksize * i
if idx ≤ nitems
reduction[tidx] = op(reduction[tidx], reduce_cuda[idx, fidx])
end
end
blksize > 32 && sync_threads()
_cuda_intrablock_reduce!(op, reduction, tidx, blksize)
tidx == 1 && (reduce_cuda[1, fidx] = reduction[1])
return nothing
end
That said, I'm not even sure if/where this comes into a simulation.