Description

This PR extends potrf_batch to additional overloads. Additionally it implements all getrs_batch overloads. This change fully implements both potrf_batch and getrs_batch for USM and Buffer for both group and strided batch operations. In neither case does cuSolver have a direct equivalent to the oneMKL function.

So in both cases some manipulation has to occur to make things work. potrf_batch implements the strided batch operation with the cuSolver group batch operation. This should be quite efficient, as it is just reformatting the input. getrs_batch is implemented using non-batched functions so this just loops over the cuSolver function for each matrix in the batch. The performance of this would not be significantly different than the user just looping over the oneMKL function themselves. This I believe this is okay, as it would provide additional convenience and portability between backends to the user.

The motivation for this change is to add greater functionality to the cuSolver backend and to improve portability between cuSolver and other lapack backends. Additionally this would provide a framework for further implementations of missing batch functions which do not align well with cuSolver functions.

Edit: Follow up commit implements remaining batched functions, with the exception of getrf group batched due to an issue with test.

Test logs: potrf_and_getrs_results.txt

Jun 17 '22 11:06 AidanBeltonS

Remaining batch functions implemented using same method as potrf and getrs functions.

Getrf group batched operation commented out and set to unimplemented due to test segfaulting. Error appears to not be related to oneMKL but lapack. Further investigation will be done on it.

Test results: Some non-batched tests are failing locally, this occurs with and without this patch. test_result.txt

Jun 23 '22 15:06 AidanBeltonS

Thanks for the PR @AidanBeltonS

Can you provide more details on the non batch failures? Setting the environment variable CTEST_OUTPUT_ON_FAILURE=1 with ctest can provide more details. Unusual errors are sometimes due to linking against Netlib libraries compiled with 32 bit integers, can you verify your reference libraries were compiled for 64bit integers?

I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.

Jun 29 '22 22:06 ericlars

Can you provide more details on the non batch failures? Setting the environment variable CTEST_OUTPUT_ON_FAILURE=1 with ctest can provide more details. Unusual errors are sometimes due to linking against Netlib libraries compiled with 32 bit integers, can you verify your reference libraries were compiled for 64bit integers?

I have attached the results below with ctest output on failures. I have double checked I have built the netlib libraries for 64bits and am linking with the correct ones.

results.txt

I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.

Thanks for the log. I will look into the failing tests.

Jul 19 '22 11:07 AidanBeltonS

@ericlars I found the issue with getrf_batch_group the problem was with my implementation, which was causing an odd failure in the reference lapack check. This has been resolved.

I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.

I have not been able to reproduce the errors you have found however I have looked further into the issue of non-batched operations failing. The problem appears to be with the addition of multiple CUDA streams per SYCL queue. This likely messes with the cusolver scope handler which assumes there is one stream per device per thread. It may be that the batch test failures you are seeing are also caused by the multistream changes. Would you be able to re-run the tests on a commit before the addition of multiple streams per queue? The DPCPP commit just before multi-streams is: d149ec39e7791a2d70858a7cf10261d6353b01be

I have attached the test logs below. Commit: d149ec39e7791a2d70858a7cf10261d6353b01be results-no-multistream.txt

Commit: dd418459868a976cd2eeae367fea6b92795ea611 results-with-multistream.txt

Jul 21 '22 09:07 AidanBeltonS

@AidanBeltonS My logs were from a build of llvm from 3/19, and it looks like the multistream patch was committed on 5/17, but for sanity I'll try building from your suggested commits. Thanks for the sleuthing.

Jul 21 '22 22:07 ericlars

@AidanBeltonS My logs were from a build of llvm from 3/19, and it looks like the multistream patch was committed on 5/17, but for sanity I'll try building from your suggested commits. Thanks for the sleuthing.

New failures confirmed as resulting from multi-streams: although it is really a bug in oneMKL because interop streams are not synced. Proposed fix here: https://github.com/oneapi-src/oneMKL/pull/215

Jul 29 '22 17:07 JackAKirk

@ericlars now that https://github.com/oneapi-src/oneMKL/pull/215 has been merged the issues resulting from the multiple CUDA stream implementation in DPC++ should be resolved, would you be able to review this again?

Sep 07 '22 09:09 AerialMantis

I have updated this PR to use the new functionality introduced in #215

Sep 07 '22 13:09 AidanBeltonS

Looks good to me!

log_llvm_cusolver_.txt

Sep 13 '22 23:09 ericlars

@ericlars it looks like we have approvals for this, could this be merged now, or is there any further discussions to which need to be resolved?

Oct 05 '22 15:10 AerialMantis

oneMKL
oneMKL copied to clipboard

[LAPACK][CUSOLVER] Add potrf and getrs batch functions to cuSolver

Description

oneMKL oneMKL copied to clipboard

[LAPACK][CUSOLVER] Add potrf and getrs batch functions to cuSolver

Description

oneMKL
oneMKL copied to clipboard