GPUArrays.jl
GPUArrays.jl copied to clipboard
copyto! does not support CPU SubArrays
Describe the bug Copyto! does not work for subarray when scalar get index is disallowed.
To Reproduce The Minimal Working Example (MWE) for this bug:
using CuArrays
CuArrays.allowscalar(false)
N = 100
u_c = rand(N, 2);
u_d = CuArrays.CuArray(ones(N));
u_v = view(u_c, :, 1);
copyto!(u_d, u_v) #works
copyto!(u_v, u_d) #does not work
Environment details (please complete this section) Details on Julia:
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Status `~/.julia/environments/v1.1/Project.toml`
[3a865a2d] CuArrays v1.2.1
I can see why avoiding scalar indexing for copyto! is non-trivial with views in general, but it would be great to have copyto! avoid non-scalar indexing for contiguous views. One of SubArray's type parameters (L) indicates whether the SubArray supports "fast linear indexing". I don't know whether that is exactly the same as being strictly contiguous in memory (i.e. in a DMA sense), but it would be nice if copyto! could somehow be a bit more discerning about when to fallback to scalar indexing for views.
The problem is that we don't want to generalize our copy routines to support all kinds of CPU arrays, because doing so introduces a whole bunch of ambiguities (we tried in the past, https://github.com/JuliaGPU/GPUArrays.jl/pull/284), while we also don't just want to duplicate the copy routines in GPUArrays (https://github.com/JuliaGPU/GPUArrays.jl/blob/2b7dbebcae54f2d315f4f482f74cd412a3c326c5/src/host/abstractarray.jl#L43-L214) and any back-end to have a method for both Array and a specifically-typed contiguous CPU-based SubArray.
I completely agree about not wanting to support all kinds of CPU arrays, but IMHO memory contiguous views seem like an important enough subset of CPU arrays to be worthy of support.
The use case where I first encountered this limitation was where I wanted to have B (where B >= 2) MxN matrices on the CPU so that when one is being transferred to the GPU the other(s) can still continue to be populated on the CPU side. I first implemented this as a single MxNxB array on the CPU and made MxN views of each slice along the B dimension. But I soon discovered that copyto! doesn't work with views (i.e. SubArrays). The workaround for this case is to just make a vector (or tuple) of B separate MxN Arrays. This results in more calls to Mem.pin and more-but-smaller DMA mappings, which seems less efficient but haven't had any problems with that in practice.
A more problematic use case is where the GPU produces MxN parts of a larger MxBN output array. I would like to be able to use copyto! to copy each MxN part directly into its place in the larger MxBN output array in CPU memory, but AFAIK this is not currently possible. The workaround is to have B different MxN output arrays, but treating those as a larger MxBN array is cumbersome at best.
It seems like there must be some way to have a copyto! method that takes a SubArray{T,N,Array,I,true} and a CuArray and performs a DMA copy if the SubArray is memory contiguous or falls back to scalar indexing if the SubArray in not memory contiguous. Maybe SubArray{T,N,Array,I,true} is already guaranteed to be memory contiguous?
Hello,
do you have any news about this issue? I think that it would be very useful for a lot of applications.
Nobody seems to be working on this, so no there have been no updates.