CoherentNoise.jl
CoherentNoise.jl copied to clipboard
GPU support?
Hi @mfiano ,
Just found this package from the new-packages-feed channel on Slack. Really nice one!
Several days back, I created a Julia wrapper of https://github.com/JuliaReinforcementLearning/FastNoise2.jl . I think both packages share a lot of utilities. But one thing which is missing in that wrapper is native GPU support. Looking into the source code of CoherentNoise.jl, I'm wondering how much extra work would it need to add GPU support for some common noise like OpenSimplex2?
It would take a bit of work, if only for two reasons:
-
We'd have to restructure the library to push large arrays to the GPU. It curently functions on a per sample basis.
-
Alternatively, re-implement everything as (probably) compute shaders. The problem here is that not all GPU vendors give the same output for a lot of mathematical operations, so the reproducibility aspect will probably be lost.
That said, if a good enough solution comes up, I would like this functionality in a future release. I fear I'm not as well versed with the Julia GPU computing landscape yet, so if anyone wants to try working on this, that would be great. For now, I'm going to leave this as a "nice to have for the future" idea.
A quick test with OpenSimplex2 shows that, with minor modifications, the code can work on both CPU and GPU.
Hoping this is useful for someone who'd like to work on it in the future.
using CUDA
const HASH_MULTIPLIER = 0x53a3f72deec546f5
const PRIME_X = 0x5205402b9270c86f
const PRIME_Y = 0x598cd327003817b5
const OS2_SKEW_2D = 0.366025403784439f0
const OS2_UNSKEW_2D = -0.21132486540518713f0
const OS2_R²2D = 0.5f0
const OS2_NUM_GRADIENTS_EXP_2D = 7
const OS2_NUM_GRADIENTS_2D = 1 << OS2_NUM_GRADIENTS_EXP_2D
const OS2_GRADIENTS_NORMALIZED_2D = Float32[
0.38268343236509, 0.923879532511287,
0.923879532511287, 0.38268343236509,
0.923879532511287, -0.38268343236509,
0.38268343236509, -0.923879532511287,
-0.38268343236509, -0.923879532511287,
-0.923879532511287, -0.38268343236509,
-0.923879532511287, 0.38268343236509,
-0.38268343236509, 0.923879532511287,
0.130526192220052, 0.99144486137381,
0.608761429008721, 0.793353340291235,
0.793353340291235, 0.608761429008721,
0.99144486137381, 0.130526192220051,
0.99144486137381, -0.130526192220051,
0.793353340291235, -0.60876142900872,
0.608761429008721, -0.793353340291235,
0.130526192220052, -0.99144486137381,
-0.130526192220052, -0.99144486137381,
-0.608761429008721, -0.793353340291235,
-0.793353340291235, -0.608761429008721,
-0.99144486137381, -0.130526192220052,
-0.99144486137381, 0.130526192220051,
-0.793353340291235, 0.608761429008721,
-0.608761429008721, 0.793353340291235,
-0.130526192220052, 0.99144486137381]
const OS2_GRADIENTS_2D = OS2_GRADIENTS_NORMALIZED_2D ./ 0.01001634121365712f0
"""
opensimplex2_2d(; kwargs...)
Construct a sampler that outputs 2-dimensional OpenSimplex2 noise when it is sampled from.
# Arguments
- `seed=0`: An integer used to seed the random number generator for this sampler.
- `orient=nothing`: Either the symbol `:x` or the value `nothing`:
+ `:x`: The noise space will be re-oriented with the Y axis pointing down the main diagonal to
improve visual isotropy.
+ `nothing`: Use the standard orientation.
"""
opensimplex2_2d(; seed=0, orient=nothing) = opensimplex2(2, seed, orient)
@inline function grad(table, seed, X, Y, x, y)
N = length(table)
hash = (seed ⊻ X ⊻ Y) * HASH_MULTIPLIER
hash ⊻= hash >> (64 - OS2_NUM_GRADIENTS_EXP_2D + 1)
i = trunc(hash) & ((OS2_NUM_GRADIENTS_2D - 1) << 1)
t = (table[mod1(i + 1, N)], table[mod1((i | 1) + 1, N)])
sum((t .* (x, y)))
end
# @inline transform(::OpenSimplex2{2,OrientStandard}, x, y) = (x, y) .+ OS2_SKEW_2D .* (x + y)
@inline transform(x, y) = (x, y) .+ OS2_SKEW_2D .* (x + y)
# @inline function transform(::OpenSimplex2{2,OrientX}, x, y)
# xx = x * ROOT_2_OVER_2
# yy = y * ROOT_2_OVER_2 * (2OS2_SKEW_2D + 1)
# (yy + xx, yy - xx)
# end
function sample(x::T, y::T; seed=123, os2_gradients_2d=OS2_GRADIENTS_2D) where {T<:Real}
primes = (PRIME_X, PRIME_Y)
tr = transform(x, y)
XY = floor.(Int, tr)
vtr = tr .- XY
t = sum(vtr) * OS2_UNSKEW_2D
X1, Y1 = XY .* primes
X2, Y2 = (X1, Y1) .+ primes
x1, y1 = vtr .+ t
us1 = 2OS2_UNSKEW_2D + 1
result = 0.0f0
a1 = OS2_R²2D - x1^2 - y1^2
if a1 > 0
result += a1^4 * grad(os2_gradients_2d, seed, X1, Y1, x1, y1)
end
a2 = 2us1 * (1 / OS2_UNSKEW_2D + 2) * t + -2us1^2 + a1
if a2 > 0
x, y = (x1, y1) .- 2OS2_UNSKEW_2D .- 1
result += a2^4 * grad(os2_gradients_2d, seed, X2, Y2, x, y)
end
if y1 > x1
x = x1 - OS2_UNSKEW_2D
y = y1 - OS2_UNSKEW_2D - 1
a3 = OS2_R²2D - x^2 - y^2
if a3 > 0
result += a3^4 * grad(os2_gradients_2d, seed, X1, Y2, x, y)
end
else
x = x1 - OS2_UNSKEW_2D - 1
y = y1 - OS2_UNSKEW_2D
a4 = OS2_R²2D - x^2 - y^2
if a4 > 0
result += a4^4 * grad(os2_gradients_2d, seed, X2, Y1, x, y)
end
end
result
end
function gpu_os!(xs, freq, start, seed, grads)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
for i in index:stride:length(xs)
pos = Tuple(CartesianIndices(xs)[i] - CartesianIndex(1, 1))
@inbounds xs[i] = sample((pos .* freq .+ start)...; seed=seed, os2_gradients_2d=grads)
end
return nothing
end
function bench_gpu_os!(xs, grads)
kernel = @cuda launch = false gpu_os!(xs, 1, (0, 0), 123, grads)
config = launch_configuration(kernel.fun)
threads = min(length(xs), config.threads)
blocks = cld(length(xs), threads)
CUDA.@sync begin
kernel(xs, 1, (0, 0), 123, grads; threads, blocks)
end
end
using BenchmarkTools
cxs = cu(fill(-1.0f0, 1024, 1024))
grads = cu(OS2_GRADIENTS_2D);
@btime bench_gpu_os!($cxs, $grads)
# 328.587 μs (73 allocations: 4.14 KiB)
xs = fill(0.0f0, 1024, 1024)
@btime for i in CartesianIndices($xs)
xs[i] = sample(Tuple(i)...)
end
# 52.659 ms (2097152 allocations: 48.00 MiB)
@btime for i in CartesianIndices($xs)
$xs[i] = sample(Tuple(i)...)
end
# 30.538 ms (0 allocations: 0 bytes)
Interesting. Could you post the benchmark next to the cpu timings, and maybe for 2-4d?
| Device | OpenSimplex2_2D ((1024,1024) |
OpenSimplex2_3D (1024,32,32) |
2D in this package | 3D in this package |
|---|---|---|---|---|
| CPU | 30.538 ms |
37.157 ms |
25.356 ms |
31.606 ms |
| GPU | 328.587 μs |
436.426 μs |
- | - |
I guess the difference on CPU is due to that I disabled the @fastpow macro?
Oh yes, that will make a big difference.
I'm also curious on what the results of two same-seeded gen_image calls looks like.
This is pretty exciting work though.
The difference is minor.
julia> mean(abs.(Array(cxs) .- xs))
5.706289f-5
julia> isapprox(Array(cxs), xs)
true
Here are the two images of OpenSimplex2_2D with seed 123:
The one from this package:

The one generated from the GPU version:

I just find that @fastpow works on the Expr level. After adding it back, the performance of the modified version on CPU is on par now.
For 2d, 274.633 μs on GPU and 25.838 ms on CPU with the modified version.
Ok, well this all looks very neat and all, but I am afraid I don't understand the CUDA code much at all, and also, one thing concerns me:
In practical usage, one will create a pipeline of many samplers fused together, through the use of FractalSampler and/or ModifierSampler. I think perhaps that the GPU parallelization here is being done at the wrong level, and we need a top-level CUDA function to wrap an arbitrary sampler in. Otherwise we will be parallelizing and syncing between arbitrarily large samplers within a pipeline, which will hurt performance.
I'm not quite sure what such a solution would look like, but maybe you have an idea being familiar with the CUDA API. If this is something you want to try tackling, a PR would be much appreciated, where we can further discuss this so that proper git diffs can be seen, etc.
Thank you for looking into this though. It looks very promising.
One possible solution is to move the global constant into the structure of each sampler. Take OpenSimplex2 for example
struct OpenSimplex2
seed
grads
end
The grads can be a shared memory on CPU or GPU. In this way, different samplers can be fused together and can be used in arbitrary kernel functions.
I'm currently working on an internal project which might try this approach. If everything works fine, I'll definitely come back and make a PR.