stdgpu Run all stdgpu operations on a specified cuda stream

I notices that some functions like stdgpu::detail::memcpy is non-async and running on DEFAULT cuda stream. More details: stdgpu::detail::memcpy depends on dispatch_memcpy and it looks like:

dispatch_memcpy(void* destination,
                const void* source,
                index64_t bytes,
                dynamic_memory_type destination_type,
                dynamic_memory_type source_type) {
   
   ...
  
   // use default stream here.
   STDGPU_CUDA_SAFE_CALL(cudaMemcpy(destination, source, static_cast<std::size_t>(bytes), kind));
}

For example. if we use cuda graph and try to catch all operations on stream, error raises because diff streams (default and customers') are mixed.

stdgpu : CUDA ERROR :
  Error     : operation would make the legacy stream depend on a capturing blocking stream
  File      : external/stdgpu/src/stdgpu/cuda/impl/memory.cpp:123
  Function  : void stdgpu::cuda::dispatch_memcpy(void *, const void *, stdgpu::index64_t, stdgpu::dynamic_memory_type, stdgpu::dynamic_memory_type)

So my request: Run all stdgpu operations on a specified cuda stream

Jun 21 '24 03:06 tanzby

Most of the functionality should support custom CUDA streams by taking a respective execution_policy which wraps the stream, see #351. Part of the memory API is one notable exception though, but the mempy-like function are not actually used in the containers. Could you provide some pointers to a particular function in stdgpu that triggers this error when called? Does it already happen when you only create a new container, e.g. auto c = stdgpu::vector<int>::createDeviceObject(1000);?

Jun 21 '24 11:06 stotko

@stotko Such as stdgpu::unordered_map<>::device_range. It's non-async

Function  : void stdgpu::cuda::dispatch_memcpy(void *, const void *, stdgpu::index64_t, stdgpu::dynamic_memory_type, stdgpu::dynamic_memory_type)
    @          0x52ff5ac stdgpu::cuda::safe_call()
    @          0x52ff476 stdgpu::cuda::dispatch_memcpy()
    @          0x52fd629 stdgpu::detail::dispatch_memcpy()
    @          0x52fd7c2 stdgpu::detail::memcpy()
    @          0x52e7ebf copyHost2DeviceArray<>()
    @          0x52e7e83 stdgpu::atomic_ref<>::store()
    @          0x52e56d9 stdgpu::atomic<>::store()
    @          0x52edf64 stdgpu::detail::unordered_base<>::device_range<>()
    @          0x52edf25 stdgpu::detail::unordered_base<>::device_range()
    @          0x52ed9be stdgpu::unordered_set<>::device_range()

The whole pipeline likes:

inert_kernel<<<>>>(xxx);                                // on stream
auto block_range = block_indices().device_range();      // a sync and blocked operation
update_block_meta_kernel<<<>>>(xxx);                    // on stream

But I have to admit that this is difficult to write in the form of operating on stream. Or I don't know if it can be achieved.

Jun 23 '24 16:06 tanzby

Thanks. Even though the device_range() method comes with an overload that accepts an execution_policy, it internally needs an atomic whose load() and store() functions use non-async mempy. We probably need respective overloads for these as well to get full stream support here.

Jun 25 '24 06:06 stotko

Sorry for the long delay. It took a larger refactoring to fill the gaps in the stream support, but with #450 this issue should be resolved.

Nov 20 '24 11:11 stotko