[FEA]: Add a copy routine to support data copy between two `mdspan`s
Is this a duplicate?
- [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
libcu++
Is your feature request related to a problem? Please describe.
CUDA Python & nvmath-python need to have a copy routine added to CCCL for copying from one mdspan to another. The requirements for this copy routine include:
- This routine would copy contents from ndarray (or a N-D tensor) A with certain data type, shape & strides to ndarray B with the same dtype & shape but not necessarily same strides.
- Since the underlying ndarrays are strided, they are not necessarily contiguous in memory or share the same memory layout, thus a dedicated copy kernel is needed
- This routine can handle
mdspans covering either host or device tensors, so that H2D/D2H copies can be abstracted out by the same API.- In the case of D2H copies, synchronous copies are fine
- This routine should be JIT-compilable (by NVRTC) to serve Python users better
This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).
We believe if src and dst are not overlapping, and if both resides on the device, there might be existing implementations from cuTENSOR (ex: cutensorPermute) based on which we can do a prototype. We can focus on functionalities first (right now the copy kernel used in nvmath-python is from CuPy), and in the future iterations improve the performance.
Describe the solution you'd like
Not sure what's the best solution, so just a thought: Perhaps offering an overload of cuda::std::copy that is specialized for mdspan?
Describe alternatives you've considered
No response
Additional context
Once this routine is offered, a Python abstraction can be built in CUDA Python or elsewhere.
(Tentatively assigned to Federico as per our offline discussion 🙂)
This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).
cc: @kmaehashi for vis
This routine should be JIT-compilable (by NVRTC)
Can you elaborate on how you envision this would work? This is necessarily a host API and NVRTC can't compile host-code.
This is necessarily a host API and NVRTC can't compile host-code.
We have a C library now, don't we? 🙂
@jrhemstad @gevtushenko Correct me if I am wrong since I am not fluent enough in mdspan: Given that shape, strides, and dtype are all run-time properties in Python, if this were a host API we would have had to instantiate a whole lot of copy kernel instances, and even so it would not cover all possibilities. Therefore, I feel NVRTC compatibility (which is a requirement of the C library anyway) is necessary.
Another reason for NVRTC compatibility: I think to unblock nvmath-python, we should just focus on the D2D copies (between potentially two different memory layouts) for now, and let nvmath-python handles the remaining H2D/D2H parts which should be easy (just use cudaMemcpyAsync with a staging buffer) and is already what CuPy does for us today. And I presume a D2D copy can be achieved by a single kernel compiled by NVRTC.
We have a C library now, don't we?
So what you really mean is "Provide a solution that doesn't require pre-instantiating a lot of kernels and may internally use NVRTC to JIT compile specific kernel instantiations".
By "NVRTC compatible" I understood you wanted it so someone could take cuda::copy(mdspan, mdspan) and compile it directly with NVRTC on their own. This wouldn't be feasible anymore than it is for someone to try and compile cub::DeviceReduce with NVRTC on their own.
I believe you are right. We should think of this new copy routine as if it were a CUB device-wide algorithm.
What I originally had in mind is really just a kernel and I wanted to do pre-/post- processing as well as kernel compilation/launch myself, but I had forgotten that this does not fit in the compute paradigm anywhere in CCCL. Thanks for the clarifying questions.
FYI, Apple MLX counterpart: https://github.com/ml-explore/mlx/pull/1421
For what it's worth, this was implemented here in RAFT (actual implementation here). It could be adapted for use outside of RAFT by switching from RAFT's resources object to just using an ordinary CUDA stream and a cuBLAS handle.
Looking more closely, I remember now that we used mdarray for some paths of the implementation, so we would need to resolve #2474 in order to adapt the code directly. We also make use of the vocabulary types mentioned in #2476, but that is much easier to work around.
Moving this from MVP scope to next steps
@fbusato is this still on your radar, or should we find someone else to take over?
This issue was probably assigned to me by mistake, or I forgot about it entirely. In any case, it's a task that could be interesting to me. We can also rely on host/device mdspan that we added recently. I cannot promise anything, but I would say July could be a good time frame.
Sounds good, thanks Federico! Let's keep this on your plate then 🙂
A quick update on this: The nvmath-python team needs to be unblocked asap, so they've been looking into adding a Python exposure of this copy routine to cuda.core with a custom C++ kernel. They probably will have something more concrete to discuss in the next few weeks. I'll let you know. But eventually we want to move the C++ bits to CCCL, so perhaps by July we can review the status and decide how to productize their prototype.
One solution we didn't think of is to directly use the cub::DeviceFor::ForEachInExtents:
cub::DeviceFor::ForEachInExtents(Extent{}, [=] __device__ (int idx, int i, int j, int k) {
d_mdspan_out(i, j, k) = d_mdspan_in(i, j, k);
});
full example here: https://godbolt.org/z/89GY4oYP1
TL;DR
We need:
- mdspan-based copy algorithm implementation on
cubend along withcuda.cccl.parallelexposure for it - modern cuda runtime interfaces in both C++ and Python for it
Context
When @fbusato started working on cub::DeviceFor::ForEachInExtents, it was meant as a generic implementation of cub::DeviceCopy::Copy that'd accept mdspan which we'd further optimize later on. The motivation is similar to that of cub::DeviceTransform: we can't see inside user operator to optimize, say, cub::DeviceFor, but when we know the memory access pattern, we can leverage ublkcp, ldgsts, vectorization etc. We might also want to expose a knob for cub::DeviceCopy::Copy that'd lead to using copy engine or an advanced kernel. We'll need that abstraction when we start working on cuda::mdarray anyways.
We also discussed exposing an interface to mdspan-based copy in modern runtime with @pciolkosz. I think it makes sense to have it there, since 1D sequence mem copies are too low-level of a facility. If it makes sense to have an mdspan-based interface in modern CUDA C++ runtime that relies on cub::DeviceCopy::Copy as an implementation detail, it should make sense to have a cuda.core interface that relies on cuda.parallel as well.
I don't think cuda.core can depend on cuda.parallel, currently it is the other way around and it'd be unattainable to change this.