kokkos-kernels
kokkos-kernels copied to clipboard
Optimize TPL interface to rocsparse spmv by caching variables for repeated use
Caching the temp variables is key to performance of repeated SpMV. I tested on one Crusher node with petsc microbenchmark (src/mat/tests/ex1k).
Average SpMV time (us) with matrix HV15R (of size 2M x 2M).
1 rank | 8 ranks | |
---|---|---|
TPL=OFF | 3743 | 636 |
TPL=ON | 14948 | 3713 |
TPL=ON + this PR | 2686 | 571 |
@lucbv I hope to make A.roc_spmv_helper
a private member, but I don't know how to access it in spmv_rocsparse()
then.
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.
@jczhang07 Is this for several calls to SpMV? Is the caching basically avoiding some expensive allocations / initializations?
Yes, for repeated calls to SpMV as commonly seen in Krylov methods.
This makes sense although I would like to think about all the TPLs together and try to come up with an implementation that takes care of all of this for cusparse, rocsparse, onemkl, etc...
That sounds reasonable. But each TPL has its own APIs. In the end, we have to have separate implementations.
I suggest we could add standalone TPLs as we go, then we might come up with better ideas to consolidate them.
Correct, I think handling the TPLs can be done in two parts
- augment the CrsMatrix to hold additional data generated by the TPL
- keep the individual kernel implementations for each TPL
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.
@lucbv Testing pipeline is clean. Any comment on merging it?
@lucbv @jczhang07 Hi guys, is there an update on this? I have some overlapping changes that I would like to put in to help support rocsparse in Tpetra, but it would be nice to use this as a starting point (my existing work ignores cusparse).
@lucbv @jczhang07 Hi guys, is there an update on this? I have some overlapping changes that I would like to put in to help support rocsparse in Tpetra, but it would be nice to use this as a starting point (my existing work ignores cusparse).
@seanofthemillers On my end, I just need reviewers and approvals to get it merged :)
@jczhang07 @seanofthemillers We are okay with this PR in general but we would like to save the temporary data in the KokkosKernels handle or the KokkosKernels controls instead of having these cached into a new object. We may want to add rocsparse matrix descriptor directly to the CrsMatrix in a similar fashion to the current cusparse matrix descriptor.
I can start working on these changes or if one of you want to start that's also fine with me : )
@lucbv I had problems in my implementation with adding the raw descriptor to the CrsMatrix. In an application like Tpetra, these KokkosSparse::CrsMatrix objects are generated on the fly (i.e. not stored and re-used), so we would have an issue where the KokkosSparse::CrsMatrix would destroy the descriptor on destruction. The Tpetra design implies that the Tpetra::CrsMatrix should own the descriptor, but it would be nice to automate the generation of the descriptors via kokkos-kernels to keep raw rocsparse/cusparse code out of Tpetra. The shared_ptr method used in this PR gives us a nice workaround.
@seanofthemillers the shared pointer can be stored in the handle or controls of Kokkos Kernels then, although I have to say having Tpetra destroy and create the matrix on the fly is a but annoying even if I understand the idea. We are just not keen on creating a new data structure when we already have two for similar purpose in the library.
I think Tpetra might be destroying our controls and handles after performing the local kernel but that's really a problem for them to figure out.
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.
@lucbv I forgot this MR hadn't been merged yet until I tested a SpMV benchmark on Frontier recently and saw the terrible performance again (https://github.com/kokkos/kokkos-kernels/pull/1469#issue-1306297831).
I rebased the MR off KK/develop. Could you have a look and get it merged?
@brian-kelley Could you test this PR in CI?