Multikernel option for bdsqr optimization

Open tfalders opened this issue 1 year ago • 0 comments

As previously discussed, I have been experimenting with optimizing BDSQR by using multiple kernel launches, with device synchronizations to determine the iterative loop's stopping condition.

Broadly speaking, I have made the following changes:

bdsqr_t2bQRstep and bdsqr_b2tQRstep have been combined into a single device function. This was intended to allow different split blocks to be processed in the same thread group, but it wasn't performant. Still, I've kept this change.
bdsqr_kernel has been divided into multiple different kernels, with the iterative loop moved to the CPU.
- bdsqr_compute determines the shift for the current split block and applies the QR step to D and E. If nv, nu, and nc are less than the specified switch size, it will also update the singular vectors.
- bdsqr_rotate updates the singular vectors using multiple thread groups.
- bdsqr_update_endpoints updates the endpoints of each split block, and spawns new split blocks if zeroes are found in the middle of a split block.
bdsqr_chk_completed is a new kernel that will determine if the stopping criterion has been met for each batch instance.
The work and splits_map arrays have been expanded to accommodate new information, and a new completed array has been added to hold information about problem status. See the comment block on line 316 for an explanation of how data is stored in these arrays.

May 09 '24 00:05 tfalders