rocSOLVER
rocSOLVER copied to clipboard
Multikernel option for bdsqr optimization
As previously discussed, I have been experimenting with optimizing BDSQR by using multiple kernel launches, with device synchronizations to determine the iterative loop's stopping condition.
Broadly speaking, I have made the following changes:
bdsqr_t2bQRstepandbdsqr_b2tQRstephave been combined into a single device function. This was intended to allow different split blocks to be processed in the same thread group, but it wasn't performant. Still, I've kept this change.bdsqr_kernelhas been divided into multiple different kernels, with the iterative loop moved to the CPU.bdsqr_computedetermines the shift for the current split block and applies the QR step to D and E. If nv, nu, and nc are less than the specified switch size, it will also update the singular vectors.bdsqr_rotateupdates the singular vectors using multiple thread groups.bdsqr_update_endpointsupdates the endpoints of each split block, and spawns new split blocks if zeroes are found in the middle of a split block.
bdsqr_chk_completedis a new kernel that will determine if the stopping criterion has been met for each batch instance.- The
workandsplits_maparrays have been expanded to accommodate new information, and a newcompletedarray has been added to hold information about problem status. See the comment block on line 316 for an explanation of how data is stored in these arrays.