cuda-python RFC: Enable per-thread default stream in free-threading builds

tl;dr: For the Python 3.13 free-threading build (cp313t), the per-thread default stream is enabled and used by default. Users need to set CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=0 to explicitly opt out and restore the old behavior.

In CUDA, there are two kinds of default streams:

Legacy default stream (synchronizing all blocking streams)
- Unless some action is done as per the CUDA Programming Guide, this is the default. Most of the time the null/0 stream is a synonym of the legacy default stream
Per-thread default stream (only synchronizing with the legacy default stream)

Today, CUDA Python offers a way to switch between the legacy and per-thread default streams (at the time of loading driver symbols) via the environment variable CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM, and the default has been as if it is set to 0 (so using the legacy default stream).

However, it is a very common pitfall for performance-seeking applications and users who find themselves needing to create nonblocking streams explicitly to avoid implicit synchronization. This change would lift the need of creating nonblocking streams. This change would also allow GPU workloads launched from different host threads -- without an explicit stream in use -- to have an opportunity of overlapping and executing in parallel, instead of being serialized on the same (legacy default) stream.

The free threading build offers a natural opportunity and perfect timing for us to change the default to as if the env var is set to 1 and using the per-thread default stream. This also gives NVIDIA a path forward to assess the feasibility of deprecating (and eventually removing!) the legacy default stream, which has been a long-time quest we seek to conquer.

Users who use the regular build will not be affected, only those testing the experimental cp313t free-threading build will.

Sep 26 '24 14:09 leofang

From the perspective of numba-cuda I think this is a positive change and a good chance to make the change to the default. Numba does already support PTDS with the user explicitly setting it: https://numba.readthedocs.io/en/stable/reference/envvars.html#envvar-NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM

The user can also explicitly force which kind of default stream they want: https://numba.readthedocs.io/en/stable/cuda-reference/host.html#numba.cuda.per_thread_default_stream / https://numba.readthedocs.io/en/stable/cuda-reference/host.html#numba.cuda.legacy_default_stream but I think this is orthogonal to what the default is.

Sep 26 '24 15:09 gmarkall

cc @kmaehashi for vis

Sep 26 '24 16:09 leofang

My worry about PTDS is that it is hare to sync across different threads’ default streams when the default value has a different meaning on every thread. Thread 2 cannot get a handle to thread 1’s default stream, if I understand correctly. Do we have a way for threads to give a handle or view on their own default stream to another thread? It seems like creating a pool of nonblocking streams is still necessary if synchronization across threads is needed — but there is a correctness concern if thread 2 wants to sync with work happening on thread 1’s default stream, because it has no way to represent that except for the magic default stream value which is a different “physical” stream for each thread.

Example:

Thread 1 calls a library function that takes a stream parameter, and launches work on the user-provided stream value of legacy stream 0.
The user calls another library function on thread 2 that depends on the previous work, and thus the user enqueues it on the same stream as before, legacy stream 0. (For simplicity, assume that we know this call is sequenced-after thread 1’s call.)
The user may not know if they are running on a free-threaded build of Python — and thus may not know if their code is using PTDS.
If PTDS is disabled, this is safe. Work on thread 2 will happen after work on thread 1 (assuming that we know the thread 2 call is sequenced after the thread 1 call).
If PTDS is enabled, this is not safe (not necessarily executed in a stream-ordered way). We cannot be sure of the order of work, because thread 1 and thread 2 interpret the legacy stream 0 as different streams.

If I am understanding correctly, a library developer has no way to enforce stream safety across threads under PTDS if the user-provided stream value is the legacy stream, while disallowing legacy stream use entirely seems like overstepping.

I may be misrepresenting the PTDS model — and would love to improve my understanding if this is not a concern. In practice this could cause bugs if the user passes legacy stream 0 in a PTDS build without knowing the semantics of PTDS — meaning that adopters of free-threaded Python would also need to know how the CUDA programming model differs under this build. This seems like a fundamental-enough distinction in the stream safety model that I would hesitate to enable it unless all of CUDA was planning to make PTDS the default. Today I am aware of very few applications that use PTDS.

Dec 13 '24 03:12 bdice

I had some discussions related to this topic offline. I am updating my position here a bit.

Do we have a way for threads to give a handle or view on their own default stream to another thread?

This is not so much of a concern. If threads wanted to synchronize across their (different) default streams, a shared event could be used to enforce synchronization.

The main concern remains that "A PTDS build can lead to race conditions in code that is otherwise safe that users were already relying on" (from @vyasr), which can be created by a situation like what I proposed in the example above. If the user's code was implicitly relying on legacy stream 0 being stream-ordered across threads, using PTDS could create race conditions. In the past, we have considered shipping libcudf builds with PTDS always enabled, but decided it was too unsafe / prone to race conditions. See discussion in https://github.com/rapidsai/cudf/pull/11281.

At present, free-threaded Python users are likely to be early-adopters, who might be more open to experimental/unstable behaviors. However, if free-threaded is the default in the future, I don't want to worry about whether enabling PTDS inadvertently added race conditions to existing code. Moreover, I think the usability of PTDS is going to be limited unless we build every library with PTDS enabled. RAPIDS hopes to ship exactly one C++ wheel per library (for reasons like package size on PyPI and CI matrix growth), and that package is independent of the Python minor version, so shipping an additional PTDS-enabled package for free-threaded Python users is not something RAPIDS would be excited about adopting.

Dec 20 '24 21:12 bdice

FWIW here's my proposal for cudf.

Dec 20 '24 22:12 vyasr

FWIW here's my proposal for cudf.

It is not that simple to just do a dynamic detection for which default stream is in use, unfortunately. cuda-bindings was implemented in this way (controlled by an env var read only once when initializing the function pointer cache) for a reason. In CUDA driver, many functions have two sets of internal symbols, one for legacy and another for PTDS. Therefore, once decided and loaded they really cannot be changed casually.

Jan 19 '25 05:01 leofang

@leofang could you please give me an example of what you're referring to?

Jan 21 '25 20:01 vyasr

This requires a bit of grep'ing to see. Try to grep in $CUDA_PATH/include for suffixes _ptds and _ptsz, and the macros __CUDA_API_PTDS and __CUDA_API_PTSZ. You will see that for CUDA_API_PER_THREAD_DEFAULT_STREAM is defined, CUDA uses driver symbols with the those two suffixes. Otherwise, it uses those without the suffixes.

Because there's no such thing as compile-time macro in Python, in cuda/bindings/_bindings/cydriver.pyx we use the said env var to control which driver symbols we load, and once loaded we can't unload and reload another set. Too much work and it's also hard to make safe.

Feb 07 '25 03:02 leofang

Couldn't cuda-python unconditionally load both sets of symbols and then dispatch appropriately at runtime based on which one we wanted?

Feb 28 '25 05:02 vyasr

It sounds like there are two problems here:

Introducing race conditions by default due to different default stream synchronization behavior between these two APIs
How best to enable this feature (e.g., loading symbols and dispatching later)

I'm not sure that 2 informs 1 and 1 seems like a blocker that we should make a decision about to make progress on this issue.

Do we have some code examples that illustrate how one might be forced to change their code if the default stream was changed to per-thread? I believe that will help drive this discussion to a decision point.

Sep 24 '25 14:09 cpcloud

I don't have any example code already written, but it's pretty easy to come up with some ideas. Consider the following (very rough) pseudocode:

def make_cuda_allocation(size):
     return cudaMallocAsync(size, default_stream)

def do_work(suballocation):
     # Do some stream-ordered work on suballocation using the default stream

def main():
    allocation = Thread(make_cuda_allocation, size).get()  # Using another thread for demonstration purposes
    tasks = [Thread(do_work, allocation[i*N:i*(N+1)]) for i in range(size//num_tasks)]

This is safe if everything runs on the legacy default stream because make_cuda_allocation is guaranteed to be completed before anything in do_work can access the allocation. With PTDS as the default this is a race because the there's no guarantee that the cudaMallocAsync in make_cuda_allocation will have completed before any of the threads executing do_work try and access the memory.

To be clear (given the context of this issue), this kind of code is also a race even in non-free-threaded builds of Python. The GIL prevents concurrency, but there is no guarantee of order of execution beyond the bytecode level so you could still have some thread start do_work before the first thread actually triggered the allocation. Almost certainly not in such a contrived example as above, but if make_cuda_allocation was executing hundreds of lines of code before the actual cudaMallocAsync call those are all points where the OS could preempt this thread and transfer control flow to one of the subsequent ones.

Sep 26 '25 18:09 vyasr

I think to simplify @vyasr's example:

def main():
    allocation = make_cuda_allocation(size)  # Uses the main thread
    tasks = [Thread(do_work, allocation[i*N:i*(N+1)]) for i in range(size//num_tasks)]  # Uses "worker" threads

With the default stream the above is safe since worker threads won't start until the allocation has been enqueued on the default stream. With per thread default stream, the allocation is enqueued on one stream and the worker threads run on different streams which is very much a race condition, regardless of the GIL. Given we're releasing the GIL everywhere in cuda.bindings regardless, I think we're going to find specific situations where free threading would fail that currently doesn't fail outside of some esoteric cases related to singletons / caching within our own libraries.

Sep 26 '25 20:09 kkraus14

I don't think we need to turn this on for FT builds for the 0.4.0 release. We're already carving out best-effort support for the FT builds, and thus we provide no guarantees about anything there except that it's available to try. It wouldn't, therefore, be considered a breaking change if eventually turn this on once we start formally supporting the FT builds at a level better than best-effort.

Oct 03 '25 19:10 cpcloud

I agree with @cpcloud

Oct 03 '25 21:10 kkraus14