Streams are blocked by cudaStreamSynchronize() in sorting functions
Description
Sadly, cudaStreamSynchronize() called at the end of sort and this call blocks the CPU thread until the sort is not finished. Because of this, we can't use multistream solution to parallelize the sorting algorithms.
I wrote a minimal repro based on the sort stream example and attached the trace results.
Sync coming from thrust, maybe somehow synchronization policy can be changed or async_sort should be used: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L482
This issue happens with argsort also, where merge sort is used: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/sort.h#L202C24-L202C44
Python GIL can affect threading, and for multiprocessing, you need the Multi-Process Service (MPS) to use multi-process parallel on a single GPU.
Tested cuPY version is 13.4.1
To Reproduce
# nvprof --print-gpu-trace python examples/stream/thrust.py
import cupy
sort_size = 128 * 1024
x1 = cupy.random.rand(sort_size)
y1 = x1.copy()
x2 = cupy.random.rand(sort_size)
y2 = x2.copy()
cupy.cuda.Device().synchronize()
stream1 = cupy.cuda.stream.Stream(non_blocking=True)
with stream1:
y1 = x1.sort()
stream2 = cupy.cuda.stream.Stream(non_blocking=True)
with stream2:
y2 = x2.sort()
cupy.cuda.Device().synchronize()
Installation
Wheel (pip install cupy-***)
Environment
# Paste the output here
Additional Information
No response
I saw issue https://github.com/cupy/cupy/issues/7759, the multi-streams sort should and can work without multiple threads.
Thanks for reporting, @neon60! I guess we should replace par with par_nosync in CuPy codebase. This looks like a new feature added in Thrust 1.16.0 which we overlooked.
par_nosync may or may not work. It is a hint not an enforcing policy. Note that there are several open issues that we still haven't addressed on the Thrust side (link).
@neon60 would you be able to help us test if switching to par_nosync actually fixes it for CuPy?
Thanks for the fast response. Yes, I can try to test par_nosync soon. Will share the results.
@neon60 How did it go?
Hi,
I'm also interested in this. Just tested, and can confirm that using par_nosync yields the expected result in the case of @neon60 's example. That is: the sort kernels are executed in an interleaved manner:
I'd be happy to open a PR with this fix applied universally in cupy_thrust.cu, if that's something I can help with.
PR is always welcomed!