LazyArrays.jl icon indicating copy to clipboard operation
LazyArrays.jl copied to clipboard

Lazy Arrays with CUDA.jl (feature request)

Open jenkspt opened this issue 2 years ago • 5 comments

This is a feature request to get LazyArrays to work with CuArrays (and potentially other GPUArray implementations)

Some of the features of LazyArrays works with CuArrays out of the box, which is awesome. For example:

using CUDA
using LazyArrays
using BenchmarkTools

a, b = CUDA.rand(1, 1000), CUDA.rand(1000, 1);

bench1(a, b) = CUDA.@sync sum(@~ a .+ b; dims=1)
@btime bench1(a, b)

bench2(a, b) = CUDA.@sync sum(a .+ b; dims=1)
@btime bench2(a, b)

Getting 125 μs and 156 μs μs respectively on my GTX 1080 Ti So using LazyArrays.jl is faster using less memory!

However reductions on BroadcastArray (via the LazyArray constructor) reverts to scalar indexing on the GPU

CUDA.allowscalar(false)
sum(LazyArray(@~ a .+ b); dims=1)

ERROR: Scalar indexing is disallowed.

Also:

  • displaying LazyArray of broadcasted CuArrays doesn't work
  • Tried using Adapt.jl on BroadcastArray, ApplyArray, and Base.Broadcast.Broadcasted, but didn't seem to help with the above problems.

I think this would be a really powerful feature -- which would make writing fast and efficient non-trivial gpu functions much easier write.

jenkspt avatar Apr 10 '22 18:04 jenkspt

I don’t have easy access to an Nvidia chip but would love to accept a PR.

I don’t quite see what you are proposing for sum. Do you want sum(a .+b ) to lower to sum(a) + sum(b)? That’s pretty easy to add

dlfivefifty avatar Apr 15 '22 06:04 dlfivefifty

Sorry that's a bad example. More generally i'm interested in reductions and accumulation over broadcasted functions i.e. reduce(*, a .+ b) or accumulate(*, a .+ b). I see this as mapreduce but with the ability to broadcast the 'mapped' function. Interestingly I'm seeing that sum(@~ a .+ b; dims=1) already works with CuArrays but not with Array (I think this isn't the intention since Broadcasted isn't an AbstractArray -- so maybe this is a peculiarity of CUDA.jl sum`

using CUDA
a, b = CUDA.rand(1, 10), CUDA.rand(8, 1);
sum(@~ a .+ b; dims=1)    # works

a, b = Array(a), Array(b)
sum(@~ a .+ b; dims=1)    # doesn't work

and the opposite is true for LazyArray

a, b = CUDA.rand(1, 10), CUDA.rand(8, 1);
sum(LazyArray(@~ a .+ b; dims=1))    # uses scalar indexing

a, b = Array(a), Array(b)
sum(LazyArray(@~ a .+ b; dims=1))    # works

With all that said -- I think my specific ask is to get LazyArray w/ reduce and accumulate to play nicely with CuArray. I'm happy to take a stab at this -- any ideas on where to focus my attention would be greatly appreciated.

jenkspt avatar Apr 21 '22 22:04 jenkspt

I think you meant

sum(LazyArray(@~ a .+ b); dims=1)

I'm a bit confused what you want to happen. We have

reduce(*, a .+ b) == (a[1] + b[1]) * … * (a[end] + b[end])

But what do you want to do when a and b are on the GPU? If I'm not mistaken calling getindex is very slow....

dlfivefifty avatar Apr 23 '22 13:04 dlfivefifty

The LazyArray defers the evaluation to the point of access in the reduce (e.g. sum) or collect operation. On the GPU reduce operations are not performed sequentially but in a tree-like scheme or (from a programming point of view) even completely in parallel for all pointwise (i.e. broadcasting) operations to minimize computation (possibly with some copy to the CPU for the last 100 elements or so). The trouble seems to be that LazyArrays and broadcasting seems to somehow have the sequential operation (iterating through the elements) bit engraved into it when executing the broadcasting? In my application I am currently not even needing the reductions to work, but would still like to use LazyArrays for the broadcast. E.g. to save memory (and memory transfer time) when calculating outer products. But this also triggers the individual scalar index access, which is to be avoided when operating on a GPU:

Warning: Performing scalar indexing on task Task (runnable) @0x000000000a2107d0.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore C:\Users\pi96doc\.julia\packages\GPUArraysCore\ZBmfM\src\GPUArraysCore.jl:90

Can the execution of the broadcast somehow be deferred to the standard broadcasting mechanism? If so, I would expect the package to also work with CUDA.

RainerHeintzmann avatar Aug 19 '22 05:08 RainerHeintzmann

... digging into the code, it seems that the issue with the loop is a minor display issue. However, when calling collect on an object created by LazyArray(@~ ...), the dest object in the materialize call, line 23 in lazybroadcasting.jl has a wrong type, being a plain vanilla Matrix rather than a CuArray. So maybe the allocation of the destination goes wrong here? But if I do a broadcast assignment to a destination, I obtain an error about some assignment using a non bits-type argument, which is a bit strange since I cannot find LazyArrays anywhere in the call stack, yet it works fine with CuArray objects.

RainerHeintzmann avatar Aug 19 '22 06:08 RainerHeintzmann