CuArrays.jl Create a CUDA context

Thanks to IRTools.jl, we can do some nifty things with Julia IR. Like using a dynamo to walk through the deep IR and offload sensible ops to the GPU.

julia> c = Conv((3,3), 3 => 16, pad = (1,1), relu); # from Flux

julia> r = rand(Float32, 32, 32, 3, 100);

julia> cuda() do
           c(r)
       end # run on GPU

julia> a = rand(Float32, 5*10^4);

julia> b = rand(Float32, 5*10^4);

julia> cuda() do
           a + b
       end
50000-element Array{Float32,1}:
 0.9649581
 1.2122422
 0.423553
...

Notice the return type is a normal Array, meaning that without much fidgeting, it is trivial to offload computation to the GPU and continue where you left off.

There are a couple caveats, not all functions behave nicely yet and we need better test coverage, but opening it now to get some review and direction of the way forward

cc @MikeInnes

ref https://github.com/JuliaGPU/CuArrays.jl/issues/303

Aug 29 '19 14:08 DhairyaLGandhi

Thanks! What is driving the choice to use IRTools over Cassette? I would prefer the maintenance burden to rest with Cassette (e.g. me)

Aug 29 '19 14:08 vchuravy

The choice was made for the little nicer control over the IR with IRTools. Also, it's conceptually simpler so maintaining it should be easier also.

It was also fairly straightforward to define in lesser code making it more readable. Mind you I'm no cassette pro, but definitely worth a discussion.

Aug 29 '19 15:08 DhairyaLGandhi

@vchuravy there probably isn't much in it, so if the lead maintainers of this package strongly prefer Cassette then I imagine it'd be OK to port it over.

Though as Dhairya points out there's a couple of potential advantages to fine grained control of the IR pass; the main one is that it's easier to cut out classes of functions we're not interested in, e.g. intrinsics or certain modules in Base, avoiding some redundant recompilation.

Aug 29 '19 16:08 MikeInnes

Very interesting! Looking forward to giving this a spin, might open up some nice new ways of doing GPU computation.

I guess we'll need some way to assert GPU execution to actually test this?

Aug 29 '19 17:08 maleadt

Yeah, for the tests I was thinking just having a context which we can look into to assert that the array is actually in there and corresponds to memory associated with the GPU

Aug 29 '19 17:08 DhairyaLGandhi

Grrml Gmail ate my reply:

Since CUDAnative will use Cassette and GPUifyLoops already does I would strongly prefer only having one tool in the GPU ecosystem to do this. I would be interested in making IRTools transforms/utility functions work with Cassette, which should work relatively straightforwardly.

Sep 02 '19 20:09 vchuravy