cupy Streams are blocked by cudaStreamSynchronize() in sorting functions

Description

Sadly, cudaStreamSynchronize() called at the end of sort and this call blocks the CPU thread until the sort is not finished. Because of this, we can't use multistream solution to parallelize the sorting algorithms.

I wrote a minimal repro based on the sort stream example and attached the trace results.

Sync coming from thrust, maybe somehow synchronization policy can be changed or async_sort should be used: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L482

This issue happens with argsort also, where merge sort is used: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L202C24-L202C44

Python GIL can affect threading, and for multiprocessing, you need the Multi-Process Service (MPS) to use multi-process parallel on a single GPU.

Tested cuPY version is 13.4.1

To Reproduce

# nvprof --print-gpu-trace python examples/stream/thrust.py
import cupy

sort_size = 128 * 1024

x1 = cupy.random.rand(sort_size)
y1 = x1.copy()

x2 = cupy.random.rand(sort_size)
y2 = x2.copy()

cupy.cuda.Device().synchronize()

stream1 = cupy.cuda.stream.Stream(non_blocking=True)
with stream1:
    y1 = x1.sort()

stream2 = cupy.cuda.stream.Stream(non_blocking=True)
with stream2:
    y2 = x2.sort()

cupy.cuda.Device().synchronize()

Installation

Wheel (pip install cupy-***)

Environment

# Paste the output here

Additional Information

No response

May 08 '25 07:05 neon60

I saw issue https://github.com/cupy/cupy/issues/7759, the multi-streams sort should and can work without multiple threads.

May 08 '25 08:05 neon60

Thanks for reporting, @neon60! I guess we should replace par with par_nosync in CuPy codebase. This looks like a new feature added in Thrust 1.16.0 which we overlooked.

May 09 '25 01:05 kmaehashi

par_nosync may or may not work. It is a hint not an enforcing policy. Note that there are several open issues that we still haven't addressed on the Thrust side (link).

@neon60 would you be able to help us test if switching to par_nosync actually fixes it for CuPy?

May 09 '25 03:05 leofang

Thanks for the fast response. Yes, I can try to test par_nosync soon. Will share the results.

May 09 '25 08:05 neon60

@neon60 How did it go?

May 19 '25 21:05 leofang

Hi, I'm also interested in this. Just tested, and can confirm that using par_nosync yields the expected result in the case of @neon60 's example. That is: the sort kernels are executed in an interleaved manner:

I'd be happy to open a PR with this fix applied universally in cupy_thrust.cu, if that's something I can help with.

Nov 19 '25 16:11 mfep

PR is always welcomed!

Nov 19 '25 16:11 leofang