Run all stdgpu operations on a specified cuda stream
I notices that some functions like stdgpu::detail::memcpy is non-async and running on DEFAULT cuda stream. More details: stdgpu::detail::memcpy depends on dispatch_memcpy and it looks like:
dispatch_memcpy(void* destination,
const void* source,
index64_t bytes,
dynamic_memory_type destination_type,
dynamic_memory_type source_type) {
...
// use default stream here.
STDGPU_CUDA_SAFE_CALL(cudaMemcpy(destination, source, static_cast<std::size_t>(bytes), kind));
}
For example. if we use cuda graph and try to catch all operations on stream, error raises because diff streams (default and customers') are mixed.
stdgpu : CUDA ERROR :
Error : operation would make the legacy stream depend on a capturing blocking stream
File : external/stdgpu/src/stdgpu/cuda/impl/memory.cpp:123
Function : void stdgpu::cuda::dispatch_memcpy(void *, const void *, stdgpu::index64_t, stdgpu::dynamic_memory_type, stdgpu::dynamic_memory_type)
So my request: Run all stdgpu operations on a specified cuda stream
Most of the functionality should support custom CUDA streams by taking a respective execution_policy which wraps the stream, see #351. Part of the memory API is one notable exception though, but the mempy-like function are not actually used in the containers. Could you provide some pointers to a particular function in stdgpu that triggers this error when called? Does it already happen when you only create a new container, e.g. auto c = stdgpu::vector<int>::createDeviceObject(1000);?
@stotko Such as stdgpu::unordered_map<>::device_range. It's non-async
Function : void stdgpu::cuda::dispatch_memcpy(void *, const void *, stdgpu::index64_t, stdgpu::dynamic_memory_type, stdgpu::dynamic_memory_type)
@ 0x52ff5ac stdgpu::cuda::safe_call()
@ 0x52ff476 stdgpu::cuda::dispatch_memcpy()
@ 0x52fd629 stdgpu::detail::dispatch_memcpy()
@ 0x52fd7c2 stdgpu::detail::memcpy()
@ 0x52e7ebf copyHost2DeviceArray<>()
@ 0x52e7e83 stdgpu::atomic_ref<>::store()
@ 0x52e56d9 stdgpu::atomic<>::store()
@ 0x52edf64 stdgpu::detail::unordered_base<>::device_range<>()
@ 0x52edf25 stdgpu::detail::unordered_base<>::device_range()
@ 0x52ed9be stdgpu::unordered_set<>::device_range()
The whole pipeline likes:
inert_kernel<<<>>>(xxx); // on stream
auto block_range = block_indices().device_range(); // a sync and blocked operation
update_block_meta_kernel<<<>>>(xxx); // on stream
But I have to admit that this is difficult to write in the form of operating on stream. Or I don't know if it can be achieved.
Thanks. Even though the device_range() method comes with an overload that accepts an execution_policy, it internally needs an atomic whose load() and store() functions use non-async mempy. We probably need respective overloads for these as well to get full stream support here.
Sorry for the long delay. It took a larger refactoring to fill the gaps in the stream support, but with #450 this issue should be resolved.