cccl [FEA]: Add a copy routine to support data copy between two `mdspan`s

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

libcu++

Is your feature request related to a problem? Please describe.

CUDA Python & nvmath-python need to have a copy routine added to CCCL for copying from one mdspan to another. The requirements for this copy routine include:

This routine would copy contents from ndarray (or a N-D tensor) A with certain data type, shape & strides to ndarray B with the same dtype & shape but not necessarily same strides.
- Since the underlying ndarrays are strided, they are not necessarily contiguous in memory or share the same memory layout, thus a dedicated copy kernel is needed
This routine can handle mdspans covering either host or device tensors, so that H2D/D2H copies can be abstracted out by the same API.
- In the case of D2H copies, synchronous copies are fine
This routine should be JIT-compilable (by NVRTC) to serve Python users better

This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).

We believe if src and dst are not overlapping, and if both resides on the device, there might be existing implementations from cuTENSOR (ex: cutensorPermute) based on which we can do a prototype. We can focus on functionalities first (right now the copy kernel used in nvmath-python is from CuPy), and in the future iterations improve the performance.

Describe the solution you'd like

Not sure what's the best solution, so just a thought: Perhaps offering an overload of cuda::std::copy that is specialized for mdspan?

Describe alternatives you've considered

No response

Additional context

Once this routine is offered, a Python abstraction can be built in CUDA Python or elsewhere.

Aug 28 '24 05:08 leofang

(Tentatively assigned to Federico as per our offline discussion 🙂)

Aug 28 '24 05:08 leofang

This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).

cc: @kmaehashi for vis

Aug 28 '24 05:08 leofang

This routine should be JIT-compilable (by NVRTC)

Can you elaborate on how you envision this would work? This is necessarily a host API and NVRTC can't compile host-code.

Aug 28 '24 15:08 jrhemstad

This is necessarily a host API and NVRTC can't compile host-code.

We have a C library now, don't we? 🙂

@jrhemstad @gevtushenko Correct me if I am wrong since I am not fluent enough in mdspan: Given that shape, strides, and dtype are all run-time properties in Python, if this were a host API we would have had to instantiate a whole lot of copy kernel instances, and even so it would not cover all possibilities. Therefore, I feel NVRTC compatibility (which is a requirement of the C library anyway) is necessary.

Sep 06 '24 04:09 leofang

Another reason for NVRTC compatibility: I think to unblock nvmath-python, we should just focus on the D2D copies (between potentially two different memory layouts) for now, and let nvmath-python handles the remaining H2D/D2H parts which should be easy (just use cudaMemcpyAsync with a staging buffer) and is already what CuPy does for us today. And I presume a D2D copy can be achieved by a single kernel compiled by NVRTC.

Sep 06 '24 05:09 leofang

We have a C library now, don't we?

So what you really mean is "Provide a solution that doesn't require pre-instantiating a lot of kernels and may internally use NVRTC to JIT compile specific kernel instantiations".

By "NVRTC compatible" I understood you wanted it so someone could take cuda::copy(mdspan, mdspan) and compile it directly with NVRTC on their own. This wouldn't be feasible anymore than it is for someone to try and compile cub::DeviceReduce with NVRTC on their own.

Sep 06 '24 16:09 jrhemstad

I believe you are right. We should think of this new copy routine as if it were a CUB device-wide algorithm.

What I originally had in mind is really just a kernel and I wanted to do pre-/post- processing as well as kernel compilation/launch myself, but I had forgotten that this does not fit in the compute paradigm anywhere in CCCL. Thanks for the clarifying questions.

Sep 06 '24 16:09 leofang

FYI, Apple MLX counterpart: https://github.com/ml-explore/mlx/pull/1421

Sep 18 '24 19:09 leofang

For what it's worth, this was implemented here in RAFT (actual implementation here). It could be adapted for use outside of RAFT by switching from RAFT's resources object to just using an ordinary CUDA stream and a cuBLAS handle.

Oct 02 '24 21:10 wphicks

Looking more closely, I remember now that we used mdarray for some paths of the implementation, so we would need to resolve #2474 in order to adapt the code directly. We also make use of the vocabulary types mentioned in #2476, but that is much easier to work around.

Oct 02 '24 21:10 wphicks

Moving this from MVP scope to next steps

Mar 19 '25 01:03 pciolkosz

@fbusato is this still on your radar, or should we find someone else to take over?

May 21 '25 00:05 leofang

This issue was probably assigned to me by mistake, or I forgot about it entirely. In any case, it's a task that could be interesting to me. We can also rely on host/device mdspan that we added recently. I cannot promise anything, but I would say July could be a good time frame.

May 21 '25 17:05 fbusato

Sounds good, thanks Federico! Let's keep this on your plate then 🙂

A quick update on this: The nvmath-python team needs to be unblocked asap, so they've been looking into adding a Python exposure of this copy routine to cuda.core with a custom C++ kernel. They probably will have something more concrete to discuss in the next few weeks. I'll let you know. But eventually we want to move the C++ bits to CCCL, so perhaps by July we can review the status and decide how to productize their prototype.

May 21 '25 18:05 leofang

One solution we didn't think of is to directly use the cub::DeviceFor::ForEachInExtents:

cub::DeviceFor::ForEachInExtents(Extent{}, [=] __device__ (int idx, int i, int j, int k) {
     d_mdspan_out(i, j, k) = d_mdspan_in(i, j, k);
});

full example here: https://godbolt.org/z/89GY4oYP1

May 25 '25 01:05 fbusato

TL;DR

We need:

mdspan-based copy algorithm implementation on cub end along with cuda.cccl.parallel exposure for it
modern cuda runtime interfaces in both C++ and Python for it

Context

When @fbusato started working on cub::DeviceFor::ForEachInExtents, it was meant as a generic implementation of cub::DeviceCopy::Copy that'd accept mdspan which we'd further optimize later on. The motivation is similar to that of cub::DeviceTransform: we can't see inside user operator to optimize, say, cub::DeviceFor, but when we know the memory access pattern, we can leverage ublkcp, ldgsts, vectorization etc. We might also want to expose a knob for cub::DeviceCopy::Copy that'd lead to using copy engine or an advanced kernel. We'll need that abstraction when we start working on cuda::mdarray anyways.

We also discussed exposing an interface to mdspan-based copy in modern runtime with @pciolkosz. I think it makes sense to have it there, since 1D sequence mem copies are too low-level of a facility. If it makes sense to have an mdspan-based interface in modern CUDA C++ runtime that relies on cub::DeviceCopy::Copy as an implementation detail, it should make sense to have a cuda.core interface that relies on cuda.parallel as well.

Sep 10 '25 18:09 gevtushenko

I don't think cuda.core can depend on cuda.parallel, currently it is the other way around and it'd be unattainable to change this.

Sep 24 '25 15:09 leofang