GraphBLAS Make row/colscale, select_bitmap more memory-friendly for CUDA

Defer memcpys from the input to output matrix in row/colscale, select_bitmap until we know whether we are using CUDA. If so, do the memcpy inside the kernel.

Mar 07 '25 06:03 VidithM

Oops... my last comment (just deleted) was a reference to LAGraph, not GraphBLAS ... I got mixed up on which package I was looking at.

Mar 07 '25 13:03 DrTimothyAldenDavis

Looks great overall, just some minor comments and tweaks above.

Mar 07 '25 13:03 DrTimothyAldenDavis

GB_cuda_ek_slice needs some work. It assumes the input matrix is always the "A" matrix, and it has a specific type for the Ap array it's using: GB_Ap_TYPE. It's not GB_Aj_TYPE so that will break if Ap and Aj have different types. That's one reason why I'm still using plain int64_t for scalars, like the return value of GB_cuda_ek_slice_entry. We probably should use C++ templates here so I can call GB_cuda_ek_slice for other matrices, like "B".

Mar 07 '25 13:03 DrTimothyAldenDavis

I see CUDA/template/GB_cuda_jit_AxB_dot3_dense_phase1.cuh is broken. I'm calling the ek_slice methods on the M matrix but internally I use the "A" matrix integers.

Mar 07 '25 13:03 DrTimothyAldenDavis

This should be good now. I'll make a follow-up PR to add flags to GB_dup_worker specifying which arrays to dup.

Mar 07 '25 19:03 VidithM

Addressed feedback and synced with the latest dev2 changes

Mar 10 '25 19:03 VidithM