thrust
thrust copied to clipboard
Support for stream ordered memory allocators.
Running into a bottleneck where I am trying to launch many thrust
scan instances concurrently over multiple streams and CPU threads. The problem is, I believe it uses CUB's underlying implementation for scan that needs temporary storage to perform the operation. And allocating and deallocating this simple temporary storage causes the entire device to synchronize on a cudaFree()
call.
CUDA 11.2 adds support for cudaFreeAsync()
(stream ordered memory allocators), so if a stream is supplied to the thrust execution policy, shouldn't the default behavior there be to have the malloc and free calls also happen on the same stream for the temporary storage?
I can provide a minimal working example if needed.
This is on our todo list. We're looking into expanding support for these new CUDA features, but it may be a while before we get to them.
For now, we do have a thread-safe caching allocator that can be used to recycle the temporary storage allocations, thrust::single_device_tls_caching_allocator
. This would help mitigate the issue. Unfortunately, I can't find any examples or tests that use it with a thrust algorithm to point you towards.
@brycelelbach and @griwes are most familiar with this part of Thrust, do either of you know where we can find a simple example that uses this feature to run a Thrust algorithm?
https://github.com/NVIDIA/thrust/issues/1391 is related to my comment above..
Sounds great, an example will be wonderful! Thank you!
related https://github.com/NVIDIA/libcudacxx/pull/105
@allisonvacanti @jrhemstad has there been any update on this with the thrust::async()
APIs?
I'm experimenting with async::sort
right now (see #1718). When profiling it seems like the cudaFree
is only done after synchronization (i.e. .wait()
on the final thrust::device_event
). So it should work for you (cudaMalloc
of later async
algorithms is overlapped with computation from earlier ones). I would be surprised if it were different for async::reduce
. If I understand the implementation correctly, the memory is owned by a unique_ptr
which is then moved into the thrust::device_event
that is returned to the caller.