LU decomposition vs. Matrix Inversion

Open DenSto opened this issue 3 years ago • 0 comments

Currently, stella uses an LU decomposition to perform the linear solve of the response matrix for the parallel streaming term. This isn't the only option, however; GS2 uses a full matrix inversion instead. There are pros and cons to both approaches: the LU decomposition itself uses about half the operations of the Gauss Jordan elimination, and so initializing the problem is twice as fast. The LU decomposition is also much more stable for ill-conditioned matrices (though I don't see why parallel streaming would be particularly ill conditioned, and so this is less of a concern).

However, the full matrix inversion comes with one big advantage: the linear solve becomes a simple matrix multiplication, rather than the forward and backward substitutions of the LU decomposition. This can actually be very important, as this part of the code scales as (N_x*N_z)^2, and so for big problems this solve can become dominant if no extra steps are taken.

I've made several attempts to parallelize the forward/backward substitutions of the LU decomposition (which are in the feature/parallel_back_sub branch), but the approaches suffer performance degradation either from synchronization (from MPI_barriers/MPI_win_fences) or lack of locality (cache misses). On the other hand, parallelizing a matrix multiplication is rather trivial, which I've done in the feature/matrix_inverse branch; in this case, the cost of the linear solve ceases to be an issue.

I suppose for many use cases, the LU decomposition is good enough, but as we push towards larger problem sizes we'll need to consider the inversion option. At the very least, we may want to somehow inform users that this option exists.

EDIT: I imagine David Dickinson has some advice on this front from his experience with GS2.

Jun 27 '22 13:06 DenSto