FoldsCUDA.jl
FoldsCUDA.jl copied to clipboard
Data-parallelism on CUDA using Transducers.jl and for loops (FLoops.jl)
FoldsCUDA
FoldsCUDA.jl provides
Transducers.jl-compatible
fold (reduce) implemented using
CUDA.jl. This brings the
transducers and reducing function combinators implemented in
Transducers.jl to GPU. Furthermore, using
FLoops.jl, you can write
parallel for loops that run on GPU.
API
FoldsCUDA exports CUDAEx, a parallel loop
executor.
It can be used with the parallel for loop created with
FLoops.@floop,
Base-like high-level parallel API in
Folds.jl, and extensible
transducers provided by
Transducers.jl.
Examples
findmax using FLoops.jl
You can pass CUDA executor FoldsCUDA.CUDAEx() to @floop to run a
parallel for loop on GPU:
julia> using FoldsCUDA, CUDA, FLoops
julia> using GPUArrays: @allowscalar
julia> xs = CUDA.rand(10^8);
julia> @allowscalar xs[100] = 2;
julia> @allowscalar xs[200] = 2;
julia> @floop CUDAEx() for (x, i) in zip(xs, eachindex(xs))
@reduce() do (imax = -1; i), (xmax = -Inf32; x)
if xmax < x
xmax = x
imax = i
end
end
end
julia> xmax
2.0f0
julia> imax # the *first* position for the largest value
100
extrema using Transducers.TeeRF
julia> using Transducers, Folds
julia> @allowscalar xs[300] = -0.5;
julia> Folds.reduce(TeeRF(min, max), xs, CUDAEx())
(-0.5f0, 2.0f0)
julia> Folds.reduce(TeeRF(min, max), (2x for x in xs), CUDAEx()) # iterator comprehension works
(-1.0f0, 4.0f0)
julia> Folds.reduce(TeeRF(min, max), Map(x -> 2x)(xs), CUDAEx()) # equivalent, using a transducer
(-1.0f0, 4.0f0)
More examples
For more examples, see the examples section in the documentation.