array-api
array-api copied to clipboard
RFC: add support for LU factorization in the linalg extension
It seems that many libraries that are candidates to implement the Array API namespace already implement the LU factorization (with variations in API and with the notable exception of numpy).
However LU is not part of the list of linear algebra operations of the current state of the SPEC:
- https://data-apis.org/array-api/latest/extensions/linear_algebra_functions.html#objects-in-api
Are there any plans to consider it for inclusion?
Thanks for asking @ogrisel. I had a look through the initial issues which considered the various linalg APIs, and LU decomposition was not considered at all there. The main reason I think being that the overview started from what is present in NumPy, and then looked at matching APIs in other libraries.
I think it's an uphill battle for now. It would require adding it to numpy.linalg, moving it in other libraries with numpy-matching APIs (e.g., https://docs.cupy.dev/en/stable/reference/scipy_linalg.html#decompositions is in the wrong place), and then aligning on APIs also with PyTorch & co. Finally, there's lu but also lu_solve and lu_factor - would it be just one of those, or 2/3?
It seems to me that LU decomposition is important enough that it's worth working on. So we could figure out what the optimal API for it would be, and then adding it to array-api-compat so it can be used in scikit-learn and SciPy. That can be done on pretty short notice I think. From there to actually standardizing it would take quite a long time I suspect (but nothing is really blocked on not having that done).
The signatures to consider:
- SciPy/
cupyx.scipy/jax.scipy:lu(a, permute_l=False, overwrite_a=False, check_finite=True) - PyTorch:
torch.linalg.lu(A, *, pivot=True, out=None)
The overwrite_a, check_finite and out keywords should all be out of scope for the standard.
The permute_l/pivot keywords do seem relevant to include. They control the return values in a different way. SciPy's permute_l returns 3 arrays if False, 2 arrays if True. That breaks a key design rule for the array API standard (no polymorphic APIs), so we can't do that. The PyTorch pivot=True behavior is okay, it always returns: a named tuple (P, L, U), and leaves P as an empty array for the non-default pivot=False.
PyTorch defaults to partial pivoting, and the keyword allows no pivoting. An LU decomposition with full pivoting is also a thing mathematically, but it does not seem implemented. JAX also has jax.lax.linalg.lu, which only does partial pivoting.
So it seems like lu(x, /) -> namedtuple(array, array, array): which defaults to partial pivoting is the minimum needed, the question is whether the other pivoting mode(s) is/are needed.
dask.array.linalg.lu has no keywords at all, and no info in the docstring about what is implemented. From the tests it's clear that it matches the SciPy default (permute_l=False).
For PyTorch, the non-default flavor is only implemented on GPU:
>>> A = torch.randn(3, 2)
... P, L, U = torch.linalg.lu(A)
>>> A = torch.randn(3, 2)
... P, L, U = torch.linalg.lu(A, pivot=False)
Traceback (most recent call last):
Cell In[6], line 2
P, L, U = torch.linalg.lu(A, pivot=False)
RuntimeError: linalg.lu_factor: LU without pivoting is not implemented on the CPU
Its docstring also notes: The LU decomposition without pivoting may not exist if any of the principal minors of A is singular.
tl;dr maybe the best way to go is to only implement partial pivoting?
Maybe we can start with a function with no argument that always returns PLU (that is only implement scipy's permute_L=False and torch's pivot=True) and it will be up to the consumer to compute.
On the other hand, I think it would be good to have an option wot do the PL product automatically and avoid allocating P. Should array API expose two methods? xp.linalg.lu that outputs a 3-tuple (P, L, U) a second function xp.linalg.permuted_lu that precomputes the PL product and always outputs a 2-tuple (P @ L, U)?
Its docstring also notes: The LU decomposition without pivoting may not exist if any of the principal minors of A is singular.
Also note, from PyTorch's doc:
The LU decomposition is almost never unique, as often there are different permutation matrices that can yield different LU decompositions. As such, different platforms, like SciPy, or inputs on different devices, may produce different valid decompositions.
Such a disclaimer should probably be mentioned in the Array API spec.
Note that scipy.linalg.lu calls:
flu, = get_flinalg_funcs(('lu',), (a1,))
p, l, u, info = flu(a1, permute_l=permute_l, overwrite_a=overwrite_a)
and flu is therefore not polymorphic internally but it p is a 1x1 array with a single 0 value when permute_l is True.
@ogrisel I opened gh-630 for the default (partial pivoting) case that seems supportable by all libraries.
On the other hand, I think it would be good to have an option wot do the
PLproduct automatically and avoid allocating P. Should array API expose two methods?xp.linalg.luthat outputs a 3-tuple(P, L, U)a second functionxp.linalg.permuted_luthat precomputes the PL product and always outputs a 2-tuple(P @ L, U)?
Perhaps. The alternative of having an empty P like PyTorch does may work, but it's not ideal. JAX would have to preallocate a full-size array in case a keyword is used and it's not literal True/False.
Given that this use case seems more niche and it's not supported by Dask and by PyTorch on CPU, and you don't need it now in scikit-learn, it seems waiting for a stronger need for this seems like the way to go here though.
We do use the "permute_l=True" case in scikit-learn.
It would be easy to provide a fallback implementation that uses an extra temporary allocation + mm product for libraries that do not natively support scipy's permute_l=True.
But it's not clear if pytorch' pivot=False is equivalent to scipy permute_l=True or doing something different.
We do use the "permute_l=True" case in scikit-learn.
Oops yes, I didn't read well.
But it's not clear if pytorch' pivot=False is equivalent to scipy permute_l=True or doing something different.
It looks like it's doing the exact same thing:
>>> import torch
>>> import numpy as np
>>> from scipy import linalg
>>> t = torch.rand(20, 10, device='cuda') # must be 'cuda', because `pivot=False` is not supported on CPU
>>> Pt, Lt, Ut = torch.linalg.lu(t.to('cuda'), pivot=False)
>>> Lt.shape
torch.Size([20, 10])
>>> Ut.shape
torch.Size([10, 10])
>>> Pt
tensor([], device='cuda:0')
>>> Ln, Un = linalg.lu(t.to('cpu').numpy(), permute_l=True)
>>> Ln.shape
(20, 10)
>>> Un.shape
(10, 10)
>>> np.testing.assert_allclose(Lt.to('cpu'), Ln)
...
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatched elements: 190 / 200 (95%)
Max absolute difference: 56.958973
Max relative difference: 181.88818
>>> (Lt @ Ut - t).max()
tensor(2.1011e-06, device='cuda:0')
>>> (Ln @ Un - t.to('cpu').numpy()).max()
1.1920929e-07
So both give valid decompositions with the same shapes for L and U, and P either empty or not returned; numerical values are very different but that's expected.
One other thing to point out, the non-pivoting case does result in a decomposition with larger errors for PyTorch:
>>> Pt, Lt, Ut = torch.linalg.lu(t)
>>> (Pt @ Lt @ Ut - t).abs().max()
tensor(1.4901e-07, device='cuda:0')
>>> Pt, Lt, Ut = torch.linalg.lu(t, pivot=False)
>>> (Lt @ Ut - t).abs().max()
tensor(2.1011e-06, device='cuda:0')
>>> Pn, Ln, Un = linalg.lu(x)
>>> np.abs((Pn @ Ln @ Un - x)).max()
1.1920929e-07
>>> PLn, Un = linalg.lu(x, permute_l=True)
>>> np.abs((PLn @ Un - x)).max()
1.1920929e-07
SciPy's pivoting decomposition seems to be a lot faster for small arrays, and has similar performance for large arrays when they are square or wide, and is several times slower when they're tall:
>>> %timeit linalg.lu(x) # shape (20, 10)
8.68 µs ± 38.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
21.7 µs ± 80.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit linalg.lu(x) # shape (400, 300)
951 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
832 µs ± 7.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit Pn @ Ln
190 µs ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> x = np.random.randn(1000, 1000)
>>> %timeit linalg.lu(x)
8.49 ms ± 9.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
7.94 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> x = np.random.randn(1000, 50) # tall array
>>> %timeit linalg.lu(x)
734 µs ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
250 µs ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> x = np.random.randn(50, 1000) # wide array
>>> %timeit linalg.lu(x)
220 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit linalg.lu(x, permute_l=True)
222 µs ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Given that scikit-learn does use permute_l=True and I suspect the tall array case is quite common for ML applications, there does seem to be a case for providing both flavors.
For completeness, some PyTorch GPU timings:
>>> t = torch.rand(1000, 1000, device='cuda') # square tensor
>>> %timeit torch.linalg.lu(t)
7.11 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit torch.linalg.lu(t, pivot=False)
1.28 ms ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> t = torch.rand(1000, 50, device='cuda') # tall tensor
>>> %timeit torch.linalg.lu(t)
2.15 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit torch.linalg.lu(t, pivot=False)
343 µs ± 6.58 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> t = torch.rand(50, 1000, device='cuda') # wide tensor
>>> %timeit torch.linalg.lu(t)
1.21 ms ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit torch.linalg.lu(t, pivot=False)
488 µs ± 800 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The performance gain is even larger there. So we have a default algorithm that's slower but stable, and an opt-in to a significantly faster algorithm.
@lezcano @nikitaved it'd be nice to get your opinion on this one. I can't find a PyTorch feature request for LU decomposition with pivot=False on CPU, despite it being much faster potentially. Is there a reason why it's only implemented on GPU?
@rgommers , if there is something for the GPU which is missing for the CPU in PyTorch, that is most likely because of gpu performance/functionality having much higher value... Especially considering the usage in DL applications. However, sometimes PyTorch is too strict and follows the suite, like with SVD, iirc. PyTorch produces the full SVD by default, even though it is not required in most cases, and hurts gpu performance overall... I, personally, would not mind having pivot=False for the CPU in PyTorch. We could create a feature request there, but I doubt the core team will be interested. But it might be a very nice on-boarding task though...
pivot=False is often quite risky, as the class of matrices for which the decomposition has a not-so-nice structure (all its principal minors must be away from zero). For example, it does not exist for a matrix A with A[0,0] = 0. With this limitation, there are very few families of matrices for which this property holds. One of the few families for which this condition holds is positive definite matrices (by Sylvester's criterion), but then the LU decomposition is just good ol' Cholesky, for which we have a specific algorithm that's x2 faster.
Partial pivoting is often a good compromise between stability and speed when compared to, say, full pivoting or other more general decompositions. For this reasons, I don't think that there are many practical cases where you would want to have a regular LU decomposition.
For example, it does not exist for a matrix
AwithA[0,0] = 0. With this limitation, there are very few families of matrices for which this property holds.
I would not say few, there are plenty of matrices that satisfy such conditions, like full-rank matrices with dominant diagonals. And in most cases diagonals are pre-conditioned anyway. Besides, pivot=True is the default, so I would assume the user is aware of the structure of matrices at hand when pivot=False is chosen.
Partial pivoting is often a good compromise between stability and speed when compared to, say, full pivoting or other more general decompositions. For this reasons, I don't think that there are many practical cases where you would want to have a regular LU decomposition.
There are cases when a single decomposition of a non-symmetric matrix leads to solving linear systems when the rhs comes in a stream, like in NNs that solve a system during training, and then (or while) the corresponding weight matrix could be decomposed to be able to do efficient inference.
The diagonal case you mentioned I don't think it's correct, but it's true that there is at least one family of matrices that I know of for which LU decompositions exist, which is column diagonally-dominant matrices.
At any rate, in most cases you do want partial pivoting due to stability reasons, so yeah. This could be implemented in PyTorch, so it shouldn't stop the array API from standarising it. I just think it shouldn't be the default.
Also, in terms of standarisation, I prefer the API we have in PyTorch for lu_solve over the one exposed in SciPy. In particular, we don't expose the trans parameter, but simply act on the strides of the input tensor to choose the best value for this parameter. We also act on the conj-bit that PyTorch implements to solve A^H X = B, but I don't know if NumPy implements that.
The diagonal case you mentioned I don't think it's correct, but it's true that there is at least one family of matrices that I know of for which LU decompositions exist, which is column diagonally-dominant matrices.
Agree. Fixed and changed to diagonal dominance which I meant in the first place :)
But on the topic of flexibility and performance my opinion is that having control over pivot is a plus. In PyTorch we even have these _ex methods that do not inform about potential issues all for the sake of performance and at the cost of correctness. Performance is quite critical in modern data-intensive applications and it seems like very much desired... I am not 100% sure that everything has to be fool-proof. So, I believe it is great that CUDA support for LU allows for pivots=False, and we can do the same for the CPU if the need arises, yes. Especially if the standard requires that.
And I agree that having trans could be redundant and that the API in PyTorch is indeed more intuitive and natural. I wish it was possible to absorb pivots just as seamlessly.
But it's not clear if pytorch' pivot=False is equivalent to scipy permute_l=True or doing something different.
It looks like it's doing the exact same thing.
I am not sure but I cannot quickly check since I do not have access to a cuda GPU at the moment:
By reading the docstring of torch.linalg.lu, I am under the impression that the returned L is always lower triangular with ones on the diagonal, whether or not pivot=False while this is not the case for scipy.linalg.lu: permute_l=True would return the product P @ L which is unlikely to be lower triangular.
EDIT: The fortran code for the scipy LU implementation is here:
- https://github.com/scipy/scipy/blob/19d7a9461e24ce110077f9b075ec745fd48bb037/scipy/linalg/src/lu.f
Also in the case of randomized_range_finder used internally by scikit-learn's PCA(solver="randomized"), I confirm that we are interested in applying LU with tall matrices for which the permutation matrix P would not fit in memory, hence implicitly precomputing the P @ L product is necessary (otherwise one would get a crash instead of a slowdown).
If you have a tall matrix, what you often want to do is to compute the PLU of the transposed matrix and use it to solve the given system.
Even more, you very rarely want to use linalg.lu. That's a function that's there just for illustrative purposes most of the time. What you want to use is linalg.lu_factor and then linalg.lu_solve. lu_factor returns a permutation vector that represents the full permutation matrix that may then be used in lu_solve. You can also implement your own lu_solve in terms of solve_triangular and scatter. We use this sometimes in PyTorch:
https://github.com/pytorch/pytorch/blob/4f2c007a1b5170c2aa0d47e388ff9e07c7a7d354/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp#L2425-L2463
Also, thank you for digging this up https://github.com/data-apis/array-api/issues/627#issuecomment-1556881760. It looks to mee that we don't want to include the permute_l function, as it's just not useful. The matrix P@L does not have any structure, and it may not be used to solve a system in any way.
@lezcano I agree. I think the matrix P should not be returned as a matrix, but rather as a vector encoding the permutation. I would suggest that vector to be the image of range(min(M, N)) under the permutation. That way we could have (L*U)[p[i], j] == X[i,j]. The empty P array would be understood as the identity permutation.
Cc @ilayn who is currently making changes to scipy.linalg.lu
@oleksandr-pavlyk I was referring to the output of linalg.lu_factor which returns the permutation encoded as a vector, but not with the usual encoding. linalg.lu_factor is just an interface into BLAS' getrf, so it returns the pivots with the same format as BLAS does.
For these functions, a minimal set of functions from which all the others can be implemented is given by:
linalg.lu_factorwhich would just be modelled afterscipy.linalg.lu_factor(which is BLAS'getrf)- Something along the lines as
linalg.qr_multiplythat could be calledlinalg.lu_multiply(name TBD), with signaturelu_multiply(pivots: Array, B: Array, trans: bool = False)that, given the pivots in BLAS' format and a matrixB, it computesP @ Bifnot transelseP^T @ B
With those two, you can implement lu_solve in terms of them and solve_triangular. You can also implement linalg.lu in terms of these + tril/triu. IMO, I would be +1 to providing linalg.lu_solve even if it's redundant, and I'd suggest not to include linalg.lu.
It looks to me that we don't want to include the permute_l function, as it's just not useful. The matrix P@L does not have any structure, and it may not be used to solve a system in any way.
In the use case in scikit-learn (https://github.com/data-apis/array-api/issues/627#issuecomment-1556898302) we use P @ L as a cheap normalization step (as alternative to QR) in randomized SVD solver after each power iteration. This happens at step 2 of the algorithm presented on page 9 of https://arxiv.org/abs/0909.4061.
While the original paper does not use this LU-based normalization trick, scikit-learn took inspiration from @tygert and @ajtulloch's alternative implementation published in https://github.com/facebookarchive/fbpca. See section 5. of:
http://tygert.com/implement.pdf
We could subsequently confirm with our own benchmarks that the LU-based renormalization would improve the speed-accuracy trade-off of our randomized SVD solver.
I don't see how we could implement this pattern with the lu_factor / lu_solve API.
The SciPy signature is going to have another keyword here denoted by p_indices for lack of a better word to return the permutation as a 1D array. See https://github.com/scipy/scipy/pull/18358
lu(a, permute_l=False, overwrite_a=False, check_finite=True, p_indices=False)
and now supports ndarrays of arbitrary size (pertaining to memory constraints). No Fortran code is going to be used and existing infra is being deprecated starting 1.12.
overwrite_a and check_finite stuff are LAPACK-isms that does not need any standardizations and at some point I'll propose a deprecation in the future by moving things to Cython native code.
The lu_solve, lu_factor are practically historical hacks for the lack of a better API so I would really discourage any standardization around it. Also LU is the most essential starting decomposition so should be present in any linalg API.
Instead LU solve and other sytrs, potrs, trtrs should all go into the polyalgorithms of linalg.solve as is for every backslash operator in the scientific computation software.
I don't see how we could implement this pattern with the lu_factor / lu_solve API.
In the API proposed in https://github.com/data-apis/array-api/issues/627#issuecomment-1557278813, there is a specific function to compute P @ B for an arbitrary B matrix, so that's how you would compute P @ L.
On a different note, that's a rather interesting trick for the implementation of that algorithm, which sacrifices a bit of stability, but not too much :) Now, I do not know a priory why would you need P @ L rather than just L, but that's a story for another day.
In the API proposed in https://github.com/data-apis/array-api/issues/627#issuecomment-1557278813, there is a specific function to compute P @ B for an arbitrary B matrix, so that's how you would compute P @ L.
I suspect that this would cause quite a bit of overhead compared to the current API but maybe it's negligible for large enough data.
Now, I do not know a priory why would you need P @ L rather than just L, but that's a story for another day.
It's quite possible that the LU re-normalization trick was written this way (with permute_l=True) to avoid attempting the allocation of P which would make it crash for tall matrices when using the scipy.linalg.lu API but that using the unpermuted L would have worked similarly. I would need to confirm that hypothesis with some numerical experiments.
Now, I do not know a priory why would you need P @ L rather than just L, but that's a story for another day.
It's quite possible that the LU re-normalization trick was written this way (with
permute_l=True) to avoid attempting the allocation ofPwhich would make it crash for tall matrices when using thescipy.linalg.luAPI but that using the unpermutedLwould have worked similarly. I would need to confirm that hypothesis with some numerical experiments.
Note that the matrix being re-normalized is often highly rank-deficient. That said, the matrix is also randomized, although the degree of randomization varies. Omitting all pivoting with a rank-deficient matrix for which the randomization may be less than perfect sounds potentially numerically unstable to me. But allocating space for a full permutation matrix (rather than just for the permutation defining it) sounds unnecessarily wasteful to me, yup. Maintaining the pivoting yet without having to form all entries of the full permutation matrix would make sense, perhaps?