CoherentNoise.jl GPU support?

Hi @mfiano ,

Just found this package from the new-packages-feed channel on Slack. Really nice one!

Several days back, I created a Julia wrapper of https://github.com/JuliaReinforcementLearning/FastNoise2.jl . I think both packages share a lot of utilities. But one thing which is missing in that wrapper is native GPU support. Looking into the source code of CoherentNoise.jl, I'm wondering how much extra work would it need to add GPU support for some common noise like OpenSimplex2?

Aug 17 '22 15:08 findmyway

It would take a bit of work, if only for two reasons:

We'd have to restructure the library to push large arrays to the GPU. It curently functions on a per sample basis.
Alternatively, re-implement everything as (probably) compute shaders. The problem here is that not all GPU vendors give the same output for a lot of mathematical operations, so the reproducibility aspect will probably be lost.

That said, if a good enough solution comes up, I would like this functionality in a future release. I fear I'm not as well versed with the Julia GPU computing landscape yet, so if anyone wants to try working on this, that would be great. For now, I'm going to leave this as a "nice to have for the future" idea.

Aug 17 '22 16:08 kanubacode

A quick test with OpenSimplex2 shows that, with minor modifications, the code can work on both CPU and GPU.

Hoping this is useful for someone who'd like to work on it in the future.

using CUDA

const HASH_MULTIPLIER = 0x53a3f72deec546f5
const PRIME_X = 0x5205402b9270c86f
const PRIME_Y = 0x598cd327003817b5

const OS2_SKEW_2D = 0.366025403784439f0
const OS2_UNSKEW_2D = -0.21132486540518713f0
const OS2_R²2D = 0.5f0
const OS2_NUM_GRADIENTS_EXP_2D = 7
const OS2_NUM_GRADIENTS_2D = 1 << OS2_NUM_GRADIENTS_EXP_2D
const OS2_GRADIENTS_NORMALIZED_2D = Float32[
    0.38268343236509, 0.923879532511287,
    0.923879532511287, 0.38268343236509,
    0.923879532511287, -0.38268343236509,
    0.38268343236509, -0.923879532511287,
    -0.38268343236509, -0.923879532511287,
    -0.923879532511287, -0.38268343236509,
    -0.923879532511287, 0.38268343236509,
    -0.38268343236509, 0.923879532511287,
    0.130526192220052, 0.99144486137381,
    0.608761429008721, 0.793353340291235,
    0.793353340291235, 0.608761429008721,
    0.99144486137381, 0.130526192220051,
    0.99144486137381, -0.130526192220051,
    0.793353340291235, -0.60876142900872,
    0.608761429008721, -0.793353340291235,
    0.130526192220052, -0.99144486137381,
    -0.130526192220052, -0.99144486137381,
    -0.608761429008721, -0.793353340291235,
    -0.793353340291235, -0.608761429008721,
    -0.99144486137381, -0.130526192220052,
    -0.99144486137381, 0.130526192220051,
    -0.793353340291235, 0.608761429008721,
    -0.608761429008721, 0.793353340291235,
    -0.130526192220052, 0.99144486137381]
const OS2_GRADIENTS_2D = OS2_GRADIENTS_NORMALIZED_2D ./ 0.01001634121365712f0

"""
    opensimplex2_2d(; kwargs...)
Construct a sampler that outputs 2-dimensional OpenSimplex2 noise when it is sampled from.
# Arguments
  - `seed=0`: An integer used to seed the random number generator for this sampler.
  - `orient=nothing`: Either the symbol `:x` or the value `nothing`:
      + `:x`: The noise space will be re-oriented with the Y axis pointing down the main diagonal to
        improve visual isotropy.
      + `nothing`: Use the standard orientation.
"""
opensimplex2_2d(; seed=0, orient=nothing) = opensimplex2(2, seed, orient)

@inline function grad(table, seed, X, Y, x, y)
    N = length(table)
    hash = (seed ⊻ X ⊻ Y) * HASH_MULTIPLIER
    hash ⊻= hash >> (64 - OS2_NUM_GRADIENTS_EXP_2D + 1)
    i = trunc(hash) & ((OS2_NUM_GRADIENTS_2D - 1) << 1)
    t = (table[mod1(i + 1, N)], table[mod1((i | 1) + 1, N)])
    sum((t .* (x, y)))
end

# @inline transform(::OpenSimplex2{2,OrientStandard}, x, y) = (x, y) .+ OS2_SKEW_2D .* (x + y)
@inline transform(x, y) = (x, y) .+ OS2_SKEW_2D .* (x + y)

# @inline function transform(::OpenSimplex2{2,OrientX}, x, y)
#     xx = x * ROOT_2_OVER_2
#     yy = y * ROOT_2_OVER_2 * (2OS2_SKEW_2D + 1)
#     (yy + xx, yy - xx)
# end

function sample(x::T, y::T; seed=123, os2_gradients_2d=OS2_GRADIENTS_2D) where {T<:Real}
    primes = (PRIME_X, PRIME_Y)
    tr = transform(x, y)
    XY = floor.(Int, tr)
    vtr = tr .- XY
    t = sum(vtr) * OS2_UNSKEW_2D
    X1, Y1 = XY .* primes
    X2, Y2 = (X1, Y1) .+ primes
    x1, y1 = vtr .+ t
    us1 = 2OS2_UNSKEW_2D + 1
    result = 0.0f0
    a1 = OS2_R²2D - x1^2 - y1^2
    if a1 > 0
        result += a1^4 * grad(os2_gradients_2d, seed, X1, Y1, x1, y1)
    end
    a2 = 2us1 * (1 / OS2_UNSKEW_2D + 2) * t + -2us1^2 + a1
    if a2 > 0
        x, y = (x1, y1) .- 2OS2_UNSKEW_2D .- 1
        result += a2^4 * grad(os2_gradients_2d, seed, X2, Y2, x, y)
    end
    if y1 > x1
        x = x1 - OS2_UNSKEW_2D
        y = y1 - OS2_UNSKEW_2D - 1
        a3 = OS2_R²2D - x^2 - y^2
        if a3 > 0
            result += a3^4 * grad(os2_gradients_2d, seed, X1, Y2, x, y)
        end
    else
        x = x1 - OS2_UNSKEW_2D - 1
        y = y1 - OS2_UNSKEW_2D
        a4 = OS2_R²2D - x^2 - y^2
        if a4 > 0
            result += a4^4 * grad(os2_gradients_2d, seed, X2, Y1, x, y)
        end
    end
    result
end

function gpu_os!(xs, freq, start, seed, grads)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i in index:stride:length(xs)
        pos = Tuple(CartesianIndices(xs)[i] - CartesianIndex(1, 1))
        @inbounds xs[i] = sample((pos .* freq .+ start)...; seed=seed, os2_gradients_2d=grads)
    end
    return nothing
end

function bench_gpu_os!(xs, grads)
    kernel = @cuda launch = false gpu_os!(xs, 1, (0, 0), 123, grads)
    config = launch_configuration(kernel.fun)
    threads = min(length(xs), config.threads)
    blocks = cld(length(xs), threads)
    CUDA.@sync begin
        kernel(xs, 1, (0, 0), 123, grads; threads, blocks)
    end
end

using BenchmarkTools

cxs = cu(fill(-1.0f0, 1024, 1024))
grads = cu(OS2_GRADIENTS_2D);

@btime bench_gpu_os!($cxs, $grads)
# 328.587 μs (73 allocations: 4.14 KiB)

xs = fill(0.0f0, 1024, 1024)

@btime for i in CartesianIndices($xs)
    xs[i] = sample(Tuple(i)...)
end
# 52.659 ms (2097152 allocations: 48.00 MiB)

@btime for i in CartesianIndices($xs)
    $xs[i] = sample(Tuple(i)...)
end
#   30.538 ms (0 allocations: 0 bytes)

Aug 20 '22 10:08 findmyway

Interesting. Could you post the benchmark next to the cpu timings, and maybe for 2-4d?

Aug 20 '22 11:08 kanubacode

Device OpenSimplex2_2D ((1024,1024) OpenSimplex2_3D (1024,32,32) 2D in this package 3D in this package

CPU 30.538 ms 37.157 ms 25.356 ms 31.606 ms

GPU 328.587 μs 436.426 μs - -

Device	OpenSimplex2_2D (`(1024,1024)`	OpenSimplex2_3D `(1024,32,32)`	2D in this package	3D in this package
CPU	`30.538 ms`	`37.157 ms`	`25.356 ms`	`31.606 ms`
GPU	`328.587 μs`	`436.426 μs`	-	-

I guess the difference on CPU is due to that I disabled the @fastpow macro?

Aug 20 '22 11:08 findmyway

Oh yes, that will make a big difference.

I'm also curious on what the results of two same-seeded gen_image calls looks like.

This is pretty exciting work though.

Aug 20 '22 12:08 kanubacode

The difference is minor.

julia> mean(abs.(Array(cxs) .- xs))
5.706289f-5

julia> isapprox(Array(cxs), xs)
true

Here are the two images of OpenSimplex2_2D with seed 123:

The one from this package:

coheren_noise_opensimplex2_2d_seed_123

The one generated from the GPU version:

gpu_opensimplex2_2d_seed_123

Aug 20 '22 13:08 findmyway

I just find that @fastpow works on the Expr level. After adding it back, the performance of the modified version on CPU is on par now.

For 2d, 274.633 μs on GPU and 25.838 ms on CPU with the modified version.

Aug 20 '22 14:08 findmyway

Ok, well this all looks very neat and all, but I am afraid I don't understand the CUDA code much at all, and also, one thing concerns me:

In practical usage, one will create a pipeline of many samplers fused together, through the use of FractalSampler and/or ModifierSampler. I think perhaps that the GPU parallelization here is being done at the wrong level, and we need a top-level CUDA function to wrap an arbitrary sampler in. Otherwise we will be parallelizing and syncing between arbitrarily large samplers within a pipeline, which will hurt performance.

I'm not quite sure what such a solution would look like, but maybe you have an idea being familiar with the CUDA API. If this is something you want to try tackling, a PR would be much appreciated, where we can further discuss this so that proper git diffs can be seen, etc.

Thank you for looking into this though. It looks very promising.

Aug 21 '22 07:08 kanubacode

One possible solution is to move the global constant into the structure of each sampler. Take OpenSimplex2 for example

struct OpenSimplex2
    seed
    grads
end

The grads can be a shared memory on CPU or GPU. In this way, different samplers can be fused together and can be used in arbitrary kernel functions.

I'm currently working on an internal project which might try this approach. If everything works fine, I'll definitely come back and make a PR.

Aug 21 '22 08:08 findmyway

CoherentNoise.jl CoherentNoise.jl copied to clipboard

GPU support?

CoherentNoise.jl
CoherentNoise.jl copied to clipboard