thrust Support for stream ordered memory allocators.

Running into a bottleneck where I am trying to launch many thrust scan instances concurrently over multiple streams and CPU threads. The problem is, I believe it uses CUB's underlying implementation for scan that needs temporary storage to perform the operation. And allocating and deallocating this simple temporary storage causes the entire device to synchronize on a cudaFree() call.

CUDA 11.2 adds support for cudaFreeAsync() (stream ordered memory allocators), so if a stream is supplied to the thrust execution policy, shouldn't the default behavior there be to have the malloc and free calls also happen on the same stream for the temporary storage?

I can provide a minimal working example if needed.

Apr 23 '21 17:04 neoblizz

This is on our todo list. We're looking into expanding support for these new CUDA features, but it may be a while before we get to them.

For now, we do have a thread-safe caching allocator that can be used to recycle the temporary storage allocations, thrust::single_device_tls_caching_allocator. This would help mitigate the issue. Unfortunately, I can't find any examples or tests that use it with a thrust algorithm to point you towards.

@brycelelbach and @griwes are most familiar with this part of Thrust, do either of you know where we can find a simple example that uses this feature to run a Thrust algorithm?

Apr 23 '21 18:04 alliepiper

https://github.com/NVIDIA/thrust/issues/1391 is related to my comment above..

Apr 23 '21 19:04 alliepiper

Sounds great, an example will be wonderful! Thank you!

Apr 23 '21 21:04 neoblizz

related https://github.com/NVIDIA/libcudacxx/pull/105

May 07 '21 17:05 jrhemstad

@allisonvacanti @jrhemstad has there been any update on this with the thrust::async() APIs?

Sep 05 '21 17:09 neoblizz

I'm experimenting with async::sort right now (see #1718). When profiling it seems like the cudaFree is only done after synchronization (i.e. .wait() on the final thrust::device_event). So it should work for you (cudaMalloc of later async algorithms is overlapped with computation from earlier ones). I would be surprised if it were different for async::reduce. If I understand the implementation correctly, the memory is owned by a unique_ptr which is then moved into the thrust::device_event that is returned to the caller.

Jun 11 '22 04:06 pauleonix

thrust thrust copied to clipboard

Support for stream ordered memory allocators.

thrust
thrust copied to clipboard