rmm [WIP] Support enabling PTDS via CUDA

Following up on a conversation with @harrism @jakirkham @rongou yesterday, I did a small change to the Cython bindings where we can enable PTDS via a CUDA_PTDS runtime environment variable, which is different from https://github.com/rapidsai/rmm/pull/480 where rebuilding the Python package is necessary. This allows us to test RMM together with other Python libraries in a non-intrusive fashion, requiring users to explicitly enable PTDS.

It's important to notice that this only works for the RMM API, for example rmm.DeviceBuffer, but using Numba to do the copying will still result on using the default stream only. To add Numba support, we may do a similar change there. Tagging @gmarkall for awareness too.

I currently used stream 2 because I didn't find an existing definition in RMM, I'm happy to change that appropriately to a definition, but I'm not sure where that should live yet.

Nov 20 '20 11:11 pentschev

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

Nov 20 '20 11:11 GPUtester

First, are you sure this is the right place to do this?

When you say "this is the right place", are you referring to RMM itself, to the Cython binding or to the CudaStreamView?

Second, is thus too heavy-handed?

Why is this heavy-handed if we only allow the user to explicitly enable it?

Third, I think the env variable should not start with CUDA. It should be very explicit, clear and dangerous sounding: RMM_FORCE_PER_THREAD_DEFAULT_STREAM.

I'm fine with a different variable name, I chose a very general-purpose one only to see how people feel about it. Should we have a very specific name like what you suggested or have a variable name that all projects could share and not only RMM, but also Numba, CuPy, etc.?

Nov 20 '20 21:11 pentschev

When you say "this is the right place", are you referring to RMM itself, to the Cython binding or to the CudaStreamView?

I mean the constructor of CudaStreamView. I suppose as long as there is still a way to explicitly request cudaStreamLegacy in this mode, it's OK.

I'm fine with a different variable name, I chose a very general-purpose one only to see how people feel about it. Should we have a very specific name like what you suggested or have a variable name that all projects could share and not only RMM, but also Numba, CuPy, etc.?

There were two things I didn't like about the name. One, it starts with CUDA, which should probably be reserved for official CUDA variables. Two, it has an acronym that many people won't know (CUDA docs never use the term "PTDS" as far as I know).

Nov 22 '20 21:11 harrism

I mean the constructor of CudaStreamView. I suppose as long as there is still a way to explicitly request cudaStreamLegacy in this mode, it's OK.

Sorry, this is still not 100% clear to me. Are you saying that you want a way to explicitly request cudaStreamLegacy even when we have the environment variable explicitly passed by the user? If so, then maybe we could consider changing the default to stream=None, and that would just choose between cudaStreamLegacy or cudaStreamPerThread depending on the environment variable, otherwise the user can pass the stream itself and that would completely bypass the environment variable mechanism.

Calling it CUDA_* implies that it's an official CUDA feature, which it is not. ... There were two things I didn't like about the name. One, it starts with CUDA, which should probably be reserved for official CUDA variables. Two, it has an acronym that many people won't know (CUDA docs never use the term "PTDS" as far as I know).

I understand all that, my intent here is asking for additional input on whether we want an RMM-specific variable (and consequently specific variables for other libraries, e.g., CUPY_CUDA_PTDS) or something more general, perhaps PYTHON_CUDA_PTDS? We could still expand that to RMM_PER_THREAD_DEFAULT_STREAM and do the same for other libraries. IMHO, I would prefer the short version and document it well, it gets very annoying to type all that and people who will use it will do so because they know what they're doing.

Nov 23 '20 10:11 pentschev

I understand all that, my intent here is asking for additional input on whether we want an RMM-specific variable (and consequently specific variables for other libraries, e.g., CUPY_CUDA_PTDS) or something more general, perhaps PYTHON_CUDA_PTDS? We could still expand that to RMM_PER_THREAD_DEFAULT_STREAM and do the same for other libraries. IMHO, I would prefer the short version and document it well, it gets very annoying to type all that and people who will use it will do so because they know what they're doing.

In Numba we'd probably want to have an env var called NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM because environment variables starting with NUMBA_ are mapped into the config module such that the config variable is numba.config.CUDA_PER_THREAD_DEFAULT_STREAM, and CUDA_ follows NUMBA_ for variables related to the CUDA target. So a more general CUDA_PTDS or PYTHON_CUDA_PTDS could be taken into account, but would likely duplicate / need resolving with the environment variable / config variable configuration in Numba.

Nov 23 '20 10:11 gmarkall

In Numba we'd probably want to have an env var called NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM because environment variables starting with NUMBA_ are mapped into the config module such that the config variable is numba.config.CUDA_PER_THREAD_DEFAULT_STREAM, and CUDA_ follows NUMBA_ for variables related to the CUDA target. So a more general CUDA_PTDS or PYTHON_CUDA_PTDS could be taken into account, but would likely duplicate / need resolving with the environment variable / config variable configuration in Numba.

This is what I thought too, although CuPy doesn't have a configuration module it follows a similar pattern for environment variables. My suggestion with PYTHON_CUDA_PTDS is only to have a simple way to enable things but I'm ok if that's undesirable, the alternative to that will probably be something like RMM_PER_THREAD_DEFAULT_STREAM=1 NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM=1 CUPY_CUDA_PER_THREAD_DEFAULT_STREAM=1, which is very convoluted and discouraging.

Nov 23 '20 11:11 pentschev

Are you saying that you want a way to explicitly request cudaStreamLegacy even when we have the environment variable explicitly passed by the user?

I mean that it already exists: pass 1 to the constructor of CudaStreamView. Although I don't like the magic numbers. I worked very hard in the C++ code to eliminate them, and I couldn't figure out how to do it correctly in Cython, so help appreciated. In the C++ we have rmm::cuda_stream_per_thread, rmm::cuda_stream_default and rmm::cuda_stream_legacy. See https://github.com/rapidsai/rmm/blob/e7f07268373652f9f36f68b284af8ca0637c6e08/include/rmm/cuda_stream_view.hpp#L112-L125

Nov 23 '20 11:11 harrism

it gets very annoying to type all that

I generally type env variables into scripts, not at the command line, so the length doesn't matter much. PTDS is not a well-known acronym, and it's easily confused with PTSD.

Nov 23 '20 11:11 harrism

I mean that it already exists: pass 1 to the constructor of CudaStreamView. Although I don't like the magic numbers. I worked very hard in the C++ code to eliminate them, and I couldn't figure out how to do it correctly in Cython, so help appreciated. In the C++ we have rmm::cuda_stream_per_thread, rmm::cuda_stream_default and rmm::cuda_stream_legacy. See

I spent some time now trying to do that and it seems indeed there's no simple, clean way to do so. In https://github.com/rapidsai/rmm/pull/633/commits/d7891178b1c6d3bf52b28862e76efab2e71a19ea I exposed the definitions for cudaStreamLegacy/cudaStreamPerThread from the C includes, but we can't use custom types in def __cinit__ apparently, so we still need to pass something like a uintptr_t stream=0 to the constructor, as it's currently done.

The solution I see are to define custom Python types, perhaps just based on unitptr_t and rely on that, while writing our own PyCudaStreamLegacy or a similar name, which we can use to pass to CudaStreamView. Perhaps @jakirkham and @gmarkall who have more experience than I do have a better suggestion to handle this issue though.

Nov 23 '20 15:11 pentschev

I generally type env variables into scripts, not at the command line, so the length doesn't matter much.

I do type variables, explicit is much better than implicit for me, throughout my life I wasted countless hours on implicit definitions from configuration files, system defaults, etc., thus I hate those things that are easy to forget, especially as time passes.

PTDS is not a well-known acronym, and it's easily confused with PTSD.

I agree it isn't a well-known acronym, but no acronyms or names are well-known from day zero. And if it's about confusing with similar acronyms from completely different areas of knowledge, we might as well just tell everyone to never use acronyms again. :)

Nov 23 '20 15:11 pentschev

Perhaps @jakirkham and @gmarkall who have more experience than I do have a better suggestion to handle this issue though.

I only have a basic working experience of Cython so I'm not sure what a better suggestion would look like :-(.

Nov 23 '20 16:11 gmarkall

I spent some time now trying to do that and it seems indeed there's no simple, clean way to do so. In d789117 I exposed the definitions for cudaStreamLegacy/cudaStreamPerThread from the C includes, but we can't use custom types in def __cinit__ apparently, so we still need to pass something like a uintptr_t stream=0 to the constructor, as it's currently done.

The solution I see are to define custom Python types, perhaps just based on unitptr_t and rely on that, while writing our own PyCudaStreamLegacy or a similar name, which we can use to pass to CudaStreamView. Perhaps @jakirkham and @gmarkall who have more experience than I do have a better suggestion to handle this issue though.

Just a drive-by comment...It is not possible to set the default value of Python def functions (such as __cinit__ here) to values defined in C. A workaround could be to define a cpdef enum that contains these values. A cpdef enum has a Python representation according to PEP-435, which can then be used as the default value:

cpdef enum cudaStream:
    cudaStreamLegacy = 0
    cudaStreamPerThread

cdef f(cudaStream s=cudaStreamLegacy):
    pass

cdef class XYZ:
    def __cinit__(self, cudaStream s=cudaStreamLegacy):
        pass

Nov 23 '20 17:11 leofang

Thanks @leofang , that's indeed very informative. I'm feeling less confused now, that was pretty much the only solution I could arrive at and your confirmation is quite helpful!

Nov 23 '20 18:11 pentschev

I changed CudaStreamView with the suggestions from https://github.com/rapidsai/rmm/pull/633#issuecomment-732323696, let me know if that's an acceptable solution.

Nov 23 '20 19:11 pentschev

I'm firmly against PTDS in the name of the environment variable. It's inconsistent with the rest of RMM and with CUDA.

Nov 23 '20 22:11 harrism

I'm firmly against PTDS in the name of the environment variable. It's inconsistent with the rest of RMM and with CUDA.

Is this https://github.com/rapidsai/rmm/pull/633/commits/1ee52afcc5ae9ef3c02d4ddc117f3a8a688042b4 ok for you?

Nov 23 '20 22:11 pentschev

Please retarget to branch-0.18 when ready. Adding to 0.18 project for tracking.

Dec 02 '20 20:12 harrism

One thing to consider here is that this environment variable currently only affects the Python side. If the app is using RMM from both Python and C++ they may see inconsistent results...

Dec 02 '20 20:12 harrism

One thing to consider here is that this environment variable currently only affects the Python side. If the app is using RMM from both Python and C++ they may see inconsistent results...

That's true, are you suggesting we should consider supporting that in C++ as well or just stating the inconsistency?

Dec 02 '20 22:12 pentschev

One thing to consider here is that this environment variable currently only affects the Python side. If the app is using RMM from both Python and C++ they may see inconsistent results...

That's true, are you suggesting we should consider supporting that in C++ as well or just stating the inconsistency?

Right now I'm stating the inconsistency. I haven't formed an opinion yet.

Dec 03 '20 01:12 harrism

I updated this PR to work with the latest branch-0.18. Would appreciate some new feedback here.

Jan 06 '21 17:01 pentschev

Trying to control per-thread default stream with environment variables does not feel like a good idea to me. It would be one thing if there was a CUDA provided environment variable that controlled this behavior, but trying to do it ad hoc per library just feels like a bad idea.

Furthermore, the existing environment variable deviates from the standard meaning of "per-thread default stream". Normally enabling PTDS changes the meaning of the value 0 from the legacy default stream to a stream per thread. What's being done here is it's using cuda_stream_per_thread (the value 2) by default, i.e., the meaning of 0 has not changed. So if RMM_PER_THREAD_DEFAULT_STREAM were enabled, and stream 0 was used anywhere, it would still mean the legacy default stream.

Jan 08 '21 21:01 jrhemstad

Trying to control per-thread default stream with environment variables does not feel like a good idea to me. It would be one thing if there was a CUDA provided environment variable that controlled this behavior, but trying to do it ad hoc per library just feels like a bad idea.

I mostly agree with that. However, we need anyway to control the PTDS behavior per library anyway, i.e., we'll need to add support to all of them somehow, but we need to start somehow/somewhere. Furthermore, using environment variables has been the most reasonable approach I've seen to enable a certain behavior globally in Python without requiring changes to user code as well as 3rd-party dependencies to such libraries, which is good at an initial stage without breaking existing code.

Furthermore, the existing environment variable deviates from the standard meaning of "per-thread default stream". Normally enabling PTDS changes the meaning of the value 0 from the legacy default stream to a stream per thread. What's being done here is it's using cuda_stream_per_thread (the value 2) by default, i.e., the meaning of 0 has not changed. So if RMM_PER_THREAD_DEFAULT_STREAM were enabled, and stream 0 was used anywhere, it would still mean the legacy default stream.

This is also true, and by no means this is intended as a flawless or future proof approach, but it allows our team and other advanced users to begin testing and profiling with PTDS, particularly with Dask. For now, my current assumption is that the libraries we'll use with PTDS will rely on RMM's default, thus using 0 isn't expected.

With the above said, I'm open to suggestions on improving this that will allow us to at least begin some testing, without each new user being required modify and install RMM from source. For the future, I think we may prefer to address each piece of code individually to treat PTDS at a local level, but that certainly involves touching pretty much all Python libraries that is used by RAPIDS today, which is going to be inefficient and even more intrusive at this stage.

Jan 08 '21 22:01 pentschev

we need anyway to control the PTDS behavior per library anyway

I don't think this is true. So long as every library exposes a stream argument to indicate on which stream an operation should occur, then one need only specify the cuda_stream_per_thread at the top level of any operation and it should then be passed down through any other invoked library.

def cudf_thing(stream)
   // pass the provided stream to downstream libraries
   do_cupy_thing(stream)
   do_numba_thing(stream)

Jan 08 '21 22:01 jrhemstad

I don't think this is true. So long as every library exposes a stream argument to indicate on which stream an operation should occur, then one need only specify the cuda_stream_per_thread at the top level of any operation and it should then be passed down through any other invoked library.

Yes, this is what I meant, sorry for being unclear. That's true, and probably the right move long-term, but right now this is still a bit distant, as we really need to make sure we can pass the stream to all those libraries.

Jan 08 '21 22:01 pentschev

I think this should be re-targeted to 0.19

Mar 02 '21 01:03 quasiben

Have updated the upstream branch targeted by the PR. Looks like that updated GitHub Projects automatically. Not sure if there's anything else needed in terms of retargeting

Mar 02 '21 01:03 jakirkham

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

Apr 03 '21 05:04 github-actions[bot]

I haven't been able to continue working on PTDS, but hopefully can come back to it in the next month or so.

Apr 06 '21 08:04 pentschev

For reference, I implemented something similar in Numba ( see https://github.com/numba/numba/pull/6936) and went with NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM, for the reasons outlined previously:

In Numba we'd probably want to have an env var called NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM because environment variables starting with NUMBA_ are mapped into the config module such that the config variable is numba.config.CUDA_PER_THREAD_DEFAULT_STREAM, and CUDA_ follows NUMBA_ for variables related to the CUDA target. So a more general CUDA_PTDS or PYTHON_CUDA_PTDS could be taken into account, but would likely duplicate / need resolving with the environment variable / config variable configuration in Numba.

Apr 15 '21 22:04 gmarkall

[WIP] Support enabling PTDS via CUDA_PTDS environment variable