oneMKL
oneMKL copied to clipboard
[LAPACK][CUSOLVER] Add potrf and getrs batch functions to cuSolver
Description
This PR extends potrf_batch
to additional overloads. Additionally it implements all getrs_batch
overloads. This change fully implements both potrf_batch
and getrs_batch
for USM and Buffer for both group and strided batch operations.
In neither case does cuSolver have a direct equivalent to the oneMKL function.
So in both cases some manipulation has to occur to make things work. potrf_batch
implements the strided batch operation with the cuSolver group batch operation. This should be quite efficient, as it is just reformatting the input. getrs_batch
is implemented using non-batched functions so this just loops over the cuSolver function for each matrix in the batch. The performance of this would not be significantly different than the user just looping over the oneMKL function themselves. This I believe this is okay, as it would provide additional convenience and portability between backends to the user.
The motivation for this change is to add greater functionality to the cuSolver backend and to improve portability between cuSolver and other lapack backends. Additionally this would provide a framework for further implementations of missing batch functions which do not align well with cuSolver functions.
Edit: Follow up commit implements remaining batched functions, with the exception of getrf group batched due to an issue with test.
Test logs: potrf_and_getrs_results.txt
Remaining batch functions implemented using same method as potrf and getrs functions.
Getrf group batched operation commented out and set to unimplemented due to test segfaulting. Error appears to not be related to oneMKL but lapack. Further investigation will be done on it.
Test results: Some non-batched tests are failing locally, this occurs with and without this patch. test_result.txt
Thanks for the PR @AidanBeltonS
Can you provide more details on the non batch failures? Setting the environment variable CTEST_OUTPUT_ON_FAILURE=1 with ctest can provide more details. Unusual errors are sometimes due to linking against Netlib libraries compiled with 32 bit integers, can you verify your reference libraries were compiled for 64bit integers?
I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.
Can you provide more details on the non batch failures? Setting the environment variable CTEST_OUTPUT_ON_FAILURE=1 with ctest can provide more details. Unusual errors are sometimes due to linking against Netlib libraries compiled with 32 bit integers, can you verify your reference libraries were compiled for 64bit integers?
I have attached the results below with ctest output on failures. I have double checked I have built the netlib libraries for 64bits and am linking with the correct ones.
I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.
Thanks for the log. I will look into the failing tests.
@ericlars I found the issue with getrf_batch_group
the problem was with my implementation, which was causing an odd failure in the reference lapack check. This has been resolved.
I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.
I have not been able to reproduce the errors you have found however I have looked further into the issue of non-batched operations failing. The problem appears to be with the addition of multiple CUDA streams per SYCL queue. This likely messes with the cusolver scope handler which assumes there is one stream per device per thread. It may be that the batch test failures you are seeing are also caused by the multistream changes. Would you be able to re-run the tests on a commit before the addition of multiple streams per queue? The DPCPP commit just before multi-streams is: d149ec39e7791a2d70858a7cf10261d6353b01be
I have attached the test logs below. Commit: d149ec39e7791a2d70858a7cf10261d6353b01be results-no-multistream.txt
Commit: dd418459868a976cd2eeae367fea6b92795ea611 results-with-multistream.txt
@AidanBeltonS My logs were from a build of llvm from 3/19, and it looks like the multistream patch was committed on 5/17, but for sanity I'll try building from your suggested commits. Thanks for the sleuthing.
@AidanBeltonS My logs were from a build of llvm from 3/19, and it looks like the multistream patch was committed on 5/17, but for sanity I'll try building from your suggested commits. Thanks for the sleuthing.
New failures confirmed as resulting from multi-streams: although it is really a bug in oneMKL because interop streams are not synced. Proposed fix here: https://github.com/oneapi-src/oneMKL/pull/215
@ericlars now that https://github.com/oneapi-src/oneMKL/pull/215 has been merged the issues resulting from the multiple CUDA stream implementation in DPC++ should be resolved, would you be able to review this again?
I have updated this PR to use the new functionality introduced in #215
@ericlars it looks like we have approvals for this, could this be merged now, or is there any further discussions to which need to be resolved?