GeophysicalFlows.jl icon indicating copy to clipboard operation
GeophysicalFlows.jl copied to clipboard

Use KernelAbstractions to accelerate `MultilayerQG.streamfunctionfrompv!`

Open glwagner opened this issue 5 years ago • 23 comments

KernelAbstractions.jl can be used to accelerate the function

https://github.com/FourierFlows/GeophysicalFlows.jl/blob/47a2b51def8af38b727dd64542c87051f169d5ea/src/multilayerqg.jl#L299-L302

A simple example showing how to use KernelAbstractions is the "Naive Transpose":

https://juliagpu.gitlab.io/KernelAbstractions.jl/examples/naive_transpose/

glwagner avatar Sep 15 '20 12:09 glwagner

The first step is to write a kernel, which will look something like

@kernel invert_column!(ψh, qh, S⁻¹)
    i, j = @index(Global, NTuple)
    @inbounds ψh[i, j] .= S⁻¹[i, j] * qh[i, j]
end

The next step is to create a work layout over which the kernel is launched. If we restrict attention to models that always have more than 32 grid points, we can use something like


# Larger workgroups are generally more efficient. For more generality, we could put an if statement that incurs
# different behavior when either nkl or nl are less than 16
workgroup = 16, 16

# The size determines how many times the kernel is run
worksize = grid.nkr, grid.nl

# This (and its useage below) will ensure the kernel is not run _before_ the data in qh is available
barrier = Event(dev)

# Creates a loop over the specified worksize, using workgroup to organize the computation
loop_invert_column! = invert_column!(dev, workgroup, worksize)

# Launch the kernel
event = loop_invert_column!(ψh, qh, params.invS, dependencies=barrier)

# This will ensure that no other operations occur until the kernel has finished
wait(dev, event)

glwagner avatar Sep 15 '20 12:09 glwagner

One thing I am not totally sure about is whether KernelAbstractions will compile away the matrix multiplication in @inbounds ψh[i, j] .= S⁻¹[i, j] * qh[i, j]. I think that it will. If not, we may have to unroll our own loop.

glwagner avatar Sep 15 '20 12:09 glwagner

By the way, I think this optimization also requires the columns of ψh[i, j] to be stored as StaticArrays. It looks like ψh is a 3D array right now. Other parts of the code may also have to converted to kernels if this change is made, since broadcasting over the 3D array would no longer work.

glwagner avatar Sep 15 '20 12:09 glwagner

With this last suggestion would x, y FFTs work nicely?

navidcy avatar Sep 15 '20 19:09 navidcy

With this last suggestion would x, y FFTs work nicely?

Oof, good point.

Hmm, maybe we need to hand-write the matrix matrix multiply then. Not sure.

glwagner avatar Sep 15 '20 20:09 glwagner

yes it's been coming to haunt us either way... (I remember a similar discussion some months ago...)

navidcy avatar Sep 15 '20 20:09 navidcy

Something like

@kernel invert_column!(ψh, qh, S⁻¹)
    i, j = @index(Global, NTuple)
    ψh_column = view(ψh, i, j, :)
    qh_column = view(qh, i, j, :)
    @inbounds ψh_column .= S⁻¹[i, j] * qh_column
end

might work.

glwagner avatar Sep 16 '20 12:09 glwagner

Otherwise a kernel along the lines of

using KernelAbstractions.Extras.LoopInfo: @unroll

@kernel invert_column!(ψh, qh, S⁻¹, nz)
    i, j = @index(Global, NTuple)

    @unroll for k = 1:nz

        @inbounds ψh[i, j, k] = 0

        @unroll for m = 1:nz
            @inbounds ψh[i, j, k] += S⁻¹[i, j][k, m] * qh[i, j, m]
        end

    end
end

might work, alternatively. Or maybe my indices are screwed up --- whichever is correct.

Nothing is too difficult, it's just a matter of trying it out.

glwagner avatar Sep 16 '20 12:09 glwagner

I should resurrect this..

navidcy avatar Dec 02 '20 20:12 navidcy

What about https://github.com/mcabbott/Tullio.jl to the rescue? (just a random thought)

navidcy avatar Mar 18 '21 05:03 navidcy

There's probably a lot of solutions! I think I gave two, but there might be more.

glwagner avatar Mar 18 '21 14:03 glwagner