GraphBLAS icon indicating copy to clipboard operation
GraphBLAS copied to clipboard

CUDA stream pools

Open VidithM opened this issue 9 months ago • 2 comments

  • Initial work on stream pools for CUDA
  • Uses a global list of streams per device, and initializes them during GB_cuda_init
  • If a thread tries to acquire a stream and they are all in use, it will wait until one is available. Alternatively, we could return a soft error in this case (something like GxB_GPU_BUSY), and then the thread could try to switch to another gpu and try acquiring one of its streams.
  • I have not yet looked into what the performance changes are from this, but it seems to work correctly.

VidithM avatar Mar 14 '25 04:03 VidithM

@DrTimothyAldenDavis addressed with changes from the 5/16 meeting:

  • User threads will no longer wait for a stream. If the pool is empty, a new stream will be created and returned. When releasing a stream, if the pool is full, the stream will be destroyed. Whether a stream was created at init time as part of the pool or at grab time can be transparent to the caller (i.e. there is no need to supply a flag as discussed in the meeting). As long as all pending work is complete on the stream, and it belongs to the correct device, it can be reused.
  • Remove use of STL mutex, use OpenMP critical section instead
  • Fix number of streams per device to 32, use std::array for per-device pool
  • Added GB_cuda_finalize and GB_cuda_stream_pool_finalize, needs to be triggered by GB_finalize.

Also re-synced with the latest dev2.

VidithM avatar May 16 '25 20:05 VidithM

Fixed the issue. Yes, this should be good to merge in now.

VidithM avatar May 17 '25 03:05 VidithM