Matrices using row stride instead of rows array
I think I would be happier if all our dense matrix type always stored rows in order, with a slot for the row stride (e.g. in bytes) instead of having an array of pointers to rows.
Pros:
- Less space, fewer allocations.
- Column vector matrices are as efficient as row vector matrices.
- Can apply vec methods to the whole matrix when the rows are contiguous.
- Creating window matrices is O(1) and requires no allocation.
- Fewer memory accesses, so many basic operations should be faster.
- Compatibility with BLAS type interfaces.
Cons
- Slower row swaps. Granted, we need these in things like Gaussian elimination and LLL, but are there any places where swapping contents of rows would actually have a measurable performance impact? Typically this will be dwarfed by arithmetic.
- Can't create row-permuting window matrices, but do we actually use this anywhere?
I totally agree.
That is a lot of pros and not many cons. Are there cases where we would expect substantial speedups by moving to the current design from the suggested one? This is unclear to me, but anyway any speedup or simplification is good to take.
Just to be sure I follow, for nmod_mat this would simply mean dropping the mat->rows array, and adapting the rest so that everything still works, right?
Concerning the row swaps, I suppose that if some algorithm needs lots of them to the point that it is impacting performance, it could instead create its own pointers to rows, or simply an array representing the current "virtual row permutation", and manipulate the matrix through this (performing the permutation only when necessary, typically at the end or before some recursive call), similarly to how it is done currently?
Currently trying this out on https://github.com/fredrik-johansson/flint/tree/matrix
A few months late, here is feedback on this and the associated (long merged) PR. I tested this against an implementation of "mbasis" (a minimal approximant basis algorithm which is based on linear algebra and whose typical implementations involve a lot of matrix permutations, like at this link ).
On a few examples, created to maximize the time spent in permutations, the new version without stride went from about the same time to at most 15% slower. On the other hand, on other examples I observed some speedups (up to 35%, although on non-typical examples). Besides it is worth noting that when writing this mbasis code I specifically tried to exploit the fact that permutations were particularly cheap; so maybe by reexamining the code, the occasional slight slowdown could disappear.
In short: thumbs up for this change.