ginkgo
ginkgo copied to clipboard
Simplify GMRES kernels
This PR separates the step1 and initialize2 kernels into individual reductions (norm and dot) and axpy/scale operations, which allows us to use the simple kernel setup for all of GMRES as well. This will also simplify the addition of CGS-Arnoldi to plain GMRES (and distributed GMRES later on)
The branch is based on simple_kernel_reduction, which is why the changes are a bit obscured. But the base isn't strictly necessary, so I could remove it.
TODO:
- [ ] Add reference kernel tests
- [x] Fix DPC++
- [x] Fix complex GMRES (since we don't have real * complex scal operation yet - #864 )
- [ ] Test more edge cases (stopping, finalized, large discrepancy between convergence speeds... across restarts)
- [ ] ~CB-GMRES (this one will need simple kernel reductions though!)~
rebase!
Error: The following files need to be formatted:
common/unified/solver/gmres_kernels.cpp
hip/base/kernel_launch.hip.hpp
reference/solver/gmres_kernels.cpp
test/solver/gmres_kernels.cpp
You can find a formatting patch under Artifacts here or run format!
if you have write access to Ginkgo
I am not 100% happy with the way I deal with the complex next_krylov
norm computation (currently, I need additional storage for that). I am looking into alternative ways, but the rest of the PR should be in a good state.
format-rebase!
Error: Rebase failed, see the related Action for details
Note: This PR changes the Ginkgo ABI:
Functions changes summary: 0 Removed, 0 Changed (16 filtered out), 6 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable
For details check the full ABI diff under Artifacts here
format-rebase!
Formatting rebase introduced changes, see Artifacts here to review them
@pratikvn I'm not so sure if the distributed case will be significantly different. I think you can just duplicate the hessenberg matrix on each process and then just do the hessenberg solve, etc. locally. The issue before was that the scalar products were done within the kernels, so we had no chance to use the distributed scalar products.
I finally ran the benchmark and compared the GPU performance between the current develop
and gmres_simplify
:
I tested 8 matrices, and all of them show a speedup of approx. 1.1
or more (meaning the change in this PR actually speeds up GMRES on an A100). This implementation was never slower than what is currently in develop
.
@yhmtsai I ran the same benchmark with 4 RHS:
Matrix name | develop iters | develop time [s] | gmres_simplify iters | gmres_simplify time [s] |
---|---|---|---|---|
G3_circuit | 704 | 9.09 | 704 | 8.10 |
t2em | 219 | 1.55 | 219 | 1.49 |
circuit5M_dc | 42 | 0.65 | 42 | 0.55 |
audikw_1 | 17169 | 196.05 | 12747 | 140.18 |
Bump_2911 | 40862 | 1205.90 | 40368 | 1058.94 |
ecology1 | 1219 | 9.81 | 1219 | 9.28 |
ss | 1089 | 16.14 | 1089 | 14.50 |
mc2depi | 3345 | 13.10 | 3294 | 13.78 |
Only mc2depi
seems to be slower, the rest is faster with this PR.
Nice results, thanks Thomas.
format-rebase!
Formatting rebase introduced changes, see Artifacts here to review them
Note: This PR changes the Ginkgo ABI:
Functions changes summary: 0 Removed, 0 Changed (16 filtered out), 6 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable
For details check the full ABI diff under Artifacts here
Error: PR already merged!