GraphBLAS CUDA stream pools

Initial work on stream pools for CUDA
Uses a global list of streams per device, and initializes them during GB_cuda_init
If a thread tries to acquire a stream and they are all in use, it will wait until one is available. Alternatively, we could return a soft error in this case (something like GxB_GPU_BUSY), and then the thread could try to switch to another gpu and try acquiring one of its streams.
I have not yet looked into what the performance changes are from this, but it seems to work correctly.

Mar 14 '25 04:03 VidithM

@DrTimothyAldenDavis addressed with changes from the 5/16 meeting:

User threads will no longer wait for a stream. If the pool is empty, a new stream will be created and returned. When releasing a stream, if the pool is full, the stream will be destroyed. Whether a stream was created at init time as part of the pool or at grab time can be transparent to the caller (i.e. there is no need to supply a flag as discussed in the meeting). As long as all pending work is complete on the stream, and it belongs to the correct device, it can be reused.
Remove use of STL mutex, use OpenMP critical section instead
Fix number of streams per device to 32, use std::array for per-device pool
Added GB_cuda_finalize and GB_cuda_stream_pool_finalize, needs to be triggered by GB_finalize.

Also re-synced with the latest dev2.

May 16 '25 20:05 VidithM

Fixed the issue. Yes, this should be good to merge in now.

May 17 '25 03:05 VidithM