rmm RFC: Support CUDA Stream Protocol

cuda.core is an official CUDA Python project: https://nvidia.github.io/cuda-python/cuda-core/latest/index.html. It offers a pythonic, self-contained, lightweight, and official interface over the CUDA programming model. For new Python projects, we encourage them to just use cuda.core.<experimental>.Stream.

For existing Python projects such as PyTorch, transitioning to cuda.core may or may not be immediately feasible. As a result, we encourage projects that already expose a CUDA stream to Python to follow the CUDA Stream protocol: https://nvidia.github.io/cuda-python/cuda-core/latest/interoperability.html#cuda-stream-protocol and add a __cuda_stream__ method to the stream class, so as to improve interoperability without introducing extra ExternalStream-like types.

Here is a PyTorch example of how it'd be used interoperably with cuda.core: https://github.com/NVIDIA/cuda-python/blob/c4f4ffe83d246eafb6adf1574e5a7c86bbcef944/cuda_core/examples/pytorch_example.py

Previously, this issue has been discussed in #1770 and #1775. Unfortunately, we still end up exposing a stream class as part of RMM public API. I consider it as a mistake that we should rectify. But it can be done and tracked separately from this effort.

cc @kkraus14 @aterrel @rparolin for vis

Sep 19 '25 19:09 leofang

We can add the protocol relatively easily, but I'd like to address this additional comment:

Unfortunately, we still end up exposing a stream class as part of RMM public API. I consider it as a mistake that we should rectify. But it can be done and tracked separately from this effort.

Is the cuda.core Stream object now stable (and does it come with cython bindings, i.e. pxd files)?

Many of the questions @bdice asked in https://github.com/rapidsai/rmm/pull/1775#issuecomment-2571687998 are still relevant for adoption here.

From a glance at the implementation it seems like if I wanted to create cuda.core.Streams from owning C++ objects I would need to implement my own wrapper anyway, because I have to manage the lifetime when importing from a foreign handle. And then I'm not sure I win anything over just implementing the protocol on my owning object.

I don't think it's easy to migrate wholesale to using cuda.core.Stream because cuda.core is very opinionated about stream-creation and device management in a way that will require wholesale changes everywhere. AFAICT, I have to interoperate with other languages through the CUDA runtime types, this is fine, they are an appropriate lingua franca, but now both sides of my interfacing have to polyfill this gap so again, why adopt rather than using my own thing which already works?

Oct 09 '25 15:10 wence-

@wence-

From a glance at the implementation it seems like if I wanted to create cuda.core.Streams from owning C++ objects I would need to implement my own wrapper anyway, because I have to manage the lifetime when importing from a foreign handle. And then I'm not sure I win anything over just implementing the protocol on my owning object.

Would you feel differently on this point if cuda.core.Stream was just a cuda::stream C++ object under the hood and would make it easier to manage lifetime between C++ and Python?

Oct 09 '25 17:10 jrhemstad

@wence-

From a glance at the implementation it seems like if I wanted to create cuda.core.Streams from owning C++ objects I would need to implement my own wrapper anyway, because I have to manage the lifetime when importing from a foreign handle. And then I'm not sure I win anything over just implementing the protocol on my owning object.

Would you feel differently on this point if cuda.core.Stream was just a cuda::stream C++ object under the hood and would make it easier to manage lifetime between C++ and Python?

I think yes (we'd need to migrate both sides of the boundary, sure, but we wouldn't need the polyfill).

Oct 09 '25 17:10 wence-

@wence- -- thank you, I've been meaning to voice this exact concern and didn't know how to start the discussion.

I think we need close parity between C++ (CCCL) and Python (cuda.core), especially via Cython objects, for us to bridge this gap.

Oct 09 '25 19:10 bdice

#2094 implemented the __cuda_stream__ protocol, so rmm.pylibrmm.stream.Stream can be used anywhere that supports the protocol.

There are some larger discussions started here, so I haven't closed this issue yet.

Oct 28 '25 12:10 TomAugspurger