ClimaAtmos.jl icon indicating copy to clipboard operation
ClimaAtmos.jl copied to clipboard

Moist Held-Suarez is qualitatively different between CPU and GPU

Open LenkaNovak opened this issue 10 months ago • 3 comments

See stand-alone Moist held-suarez atmos runs here: https://buildkite.com/clima/climacoupler-longruns/builds/514

Running with this script, which mimics the atmos driver.

Some plots (100d, i.e. 10 10-day averages):

Zonal mean wind Screen Shot 2024-04-04 at 7 07 43 PM

Zonal mean temperature Screen Shot 2024-04-04 at 7 08 11 PM

1st level temperature Screen Shot 2024-04-04 at 7 09 00 PM

Instantaneous 100d rhoe_tot (first level)

Screen Shot 2024-04-04 at 7 16 50 PM

LenkaNovak avatar Apr 05 '24 02:04 LenkaNovak

I think that the first thing that we should do in narrowing down this issue, is making sure that we get the same result within machine precision between CPU and GPU. A few places where we could have differences is where the order of operations could differ between the two implementations, in particular, reductions and DSS.

charleskawczynski avatar Apr 05 '24 14:04 charleskawczynski

Thanks, @charleskawczynski , this sounds like a good way forward.

@juliasloan25 has already done some machine precision tests of atmos states, and she's seen departures after the first timestep (https://github.com/CliMA/ClimaCoupler.jl/pull/614). I don't think we've looked any deeper yet though. Do we already have any CPU-GPU consistency tests in ClimaCore? I thought someone's mentioned them recently but I'm not totally sure where to look.

LenkaNovak avatar Apr 05 '24 16:04 LenkaNovak

Here is one place in ClimaCore where I'd be surprised to see bitwise equality:

function reduce_cuda_blocks_kernel!(
    reduce_cuda::AbstractArray{T, 2},
    op,
    ::Val{shmemsize},
) where {T, shmemsize}
    blksize = blockDim().x
    fidx = blockIdx().x
    tidx = threadIdx().x
    nitems = size(reduce_cuda, 1)
    nloads = cld(nitems, blksize) - 1
    reduction = CUDA.CuStaticSharedArray(T, shmemsize)

    reduction[tidx] = reduce_cuda[tidx, fidx]

    for i in 1:nloads
        idx = tidx + blksize * i
        if idx ≤ nitems
            reduction[tidx] = op(reduction[tidx], reduce_cuda[idx, fidx])
        end
    end

    blksize > 32 && sync_threads()
    _cuda_intrablock_reduce!(op, reduction, tidx, blksize)

    tidx == 1 && (reduce_cuda[1, fidx] = reduction[1])
    return nothing
end

That said, I'm not even sure if/where this comes into a simulation.

charleskawczynski avatar Apr 05 '24 19:04 charleskawczynski