Make row/colscale, select_bitmap more memory-friendly for CUDA
- Defer memcpys from the input to output matrix in row/colscale, select_bitmap until we know whether we are using CUDA. If so, do the memcpy inside the kernel.
Oops... my last comment (just deleted) was a reference to LAGraph, not GraphBLAS ... I got mixed up on which package I was looking at.
Looks great overall, just some minor comments and tweaks above.
GB_cuda_ek_slice needs some work. It assumes the input matrix is always the "A" matrix, and it has a specific type for the Ap array it's using: GB_Ap_TYPE. It's not GB_Aj_TYPE so that will break if Ap and Aj have different types. That's one reason why I'm still using plain int64_t for scalars, like the return value of GB_cuda_ek_slice_entry. We probably should use C++ templates here so I can call GB_cuda_ek_slice for other matrices, like "B".
I see CUDA/template/GB_cuda_jit_AxB_dot3_dense_phase1.cuh is broken. I'm calling the ek_slice methods on the M matrix but internally I use the "A" matrix integers.
This should be good now. I'll make a follow-up PR to add flags to GB_dup_worker specifying which arrays to dup.
Addressed feedback and synced with the latest dev2 changes