aesara
aesara copied to clipboard
Add `Op`s for `solve_discrete_lyapunov` and `solve_continuous_lyapunov`
Following our discussions in #1015 and #1011, this PR adds Ops to aesara.tensor.slinalg
that wrap scipy.linalg.solve_discrete_lyapunov
and scipy.linalg.solve_continuous_lyapunov
, as well as compute the reverse-mode gradients for each.
One note, I had an error from mypy when running the pre-commit hooks:
aesara\link\c\cmodule.py:2439: error: Unused "type: ignore" comment
But this is unrelated to any changes i made, so I didn't know what to do with it.
While the direct solver can be written elegantly using only aesara Ops, I still think there is value-add to handing the forward part to scipy. As noted in the scipy documentation, the direct solver scales very poorly with the size of the matrices. Even at N=50, the bilinear solver is two orders of magnitude faster than the direct solver (180 ms vs <1ms). The memory footprint is also much lower, because a Kronecker product is not required. Here is a graph comparing performance of the Op from this PR, using the bilinear solver, to the pure aesara direct solver.
Sizes larger than N=250 could not be allocated in memory for the direct solution. In the application I am thinking about (kalman filtering), this routine would need to be called once per logp evaluation, so speed is important.
I realize this is an unfair comparison, and that the real question is whether the bilinear solver could be directly implemented in Aesara. It would first require an implementation of the Schur decomposition (imaginary part is discarded, so this is not an issue), then an implementation of a Sylvester equation solver, which would in turn require an efficient solution routine like trsyl, otherwise we run into the direct solver problem again. Or, decide that having access to trsyl is good, so wrap the scipy sylvester equation solver, then call that from a semi-pure aesara implementation of solve_lyapunov, but this strikes me as arbitrary, especially if the schur decomposition ended up as a scipy wrapper as well.
I realize this is an unfair comparison, and that the real question is whether the bilinear solver could be directly implemented in Aesara.
There's no need to justify the inclusion of a relevant SciPy feature, and, yes, the question is whether or not the other estimation approach that is supported by SciPy can be implemented with existing Aesara Op
s, like the direct approach.
If need be, a custom Op
can always be used for only the other approach, but it's more important that we adequately assess an Aesara-based implementation and determine what's missing, if anything.
For reference, the other discrete approach (i.e. "bilinear") appears to be here, and, from a quick glance, it looks to be well covered by existing Op
s.
The relevant lines I see as being not covered are here, inside the solve_continuous_lyapunov
function:
# Compute the Schur decomposition form of a
r, u = schur(a, output='real')
# Construct f = u'*q*u
f = u.conj().T.dot(q.dot(u))
# Call the Sylvester equation solver
trsyl = get_lapack_funcs('trsyl', (r, f))
It would first require an implementation of the Schur decomposition (imaginary part is discarded, so this is not an issue), then an implementation of a Sylvester equation solver, which would in turn require an efficient solution routine like trsyl, otherwise we run into the direct solver problem again. Or, decide that having access to trsyl is good, so wrap the scipy sylvester equation solver, then call that from a semi-pure aesara implementation of solve_lyapunov, but this strikes me as arbitrary, especially if the schur decomposition ended up as a scipy wrapper as well.
Yes, a Shur decomposition and trsyl
Op
are needed, and why would adding them be "arbitrary"?
"Arbitrary" here refers to the criterion for writing an Op wrapper around a given scipy function, but the word was chosen more out of my ignorance of the criterion than any capriciousness (or lack thereof) in the criterion itself. What I am gleaning is that the ideal criteria for a custom scipy Op is, "as close to LAPACK as possible"? One could get more atomic than that, but I doubt it would be productive.
I am also struggling with the merits and demerits of having analytic gradient expressions. My naive perspective is that, even if we had implementations of schur and trsyl, it would still be better to manually implement the lyapunov ops, because we have an analytic expression of the reverse-mode gradients. We relieve the system of the need to compute these, and side-step any potential issues arising from approximation.
"Arbitrary" here refers to the criterion for writing an Op wrapper around a given scipy function, but the word was chosen more out of my ignorance of the criterion than any capriciousness (or lack thereof) in the criterion itself. What I am gleaning is that the ideal criteria for a custom scipy Op is, "as close to LAPACK as possible"? One could get more atomic than that, but I doubt it would be productive.
Ah, yeah, the "depth" at which something should be implemented is always measured relative to its complexity and ancillary benefits. As you've noticed, this library's current "depth" is somewhere around the BLAS/LAPACK level. As a result, the complexity of implementing something at that level is relatively low, and there are multiple examples illustrating how to implement and test such Op
s (e.g. see aesara.tensor.blas
). These examples make good use of the old C backend, so they also serve as good templates for making performant Op
s.
Those new Op
s can also be implemented in exactly the same way as the current ones in this PR (e.g. without C, Numba, and/or JAX implementations). We can always return to them later and add other backend implementations, if need be. Likewise, their gradients shouldn't be any more complicated than the current ones.
Work at this "depth" provides two important ancillary benefits that are exclusive to this approach:
- two new low-level
Op
implementations that can be used to implement other "compound" functions, and - access to optimizations on the intermediate
Op
s used by the compound implementations.
Our ability to reason about expressions and manipulate them in an automated fashion is what allows many of the performance and stability optimizations that set this project apart from plain NumPy and SciPy–as well as other tensor libraries. Redundant, high-level Op
s that combine the functionality of existing Op
s effectively hide information from Aesara that could be used to perform optimizations.
Consider an Op
that is equivalent to $f(x) = \exp(x^2)$. If a graph is compiled that represents the expression $\log \exp(x^2)$, then Aesara is able to "analytically" reduce that expression to $x^2$ using one simple rewrite. If a computationally equivalent graph representing $\log f(x)$ is compiled, then an unnecessary exponential function evaluation is required.
One could construct new rewrites explicitly for the redundant Op
, but those rewrites would be redundant themselves and only add to the development and testing effort. The same goes for transpilation support (e.g. C, Numba, JAX).
In this situation, redundancy begets redundancy, so, unless the benefits of introducing redundancy in a particular instance are very clear, we must avoid it altogether.
I am also struggling with the merits and demerits of having analytic gradient expressions. My naive perspective is that, even if we had implementations of schur and trsyl, it would still be better to manually implement the lyapunov ops, because we have an analytic expression of the reverse-mode gradients. We relieve the system of the need to compute these, and side-step any potential issues arising from approximation.
First, we always have "analytic" expression for the gradients. Numerical approximations are not produced or used by Aesara.
Second, the Op.grad
implementations you've already produced necessarily contain the same gradient "information" that would go into Schur and trsyl
Op.grad
implementations. More specifically, since trsyl
is a solver for the Sylvester equations, you already have its Op.grad
implementation in the paper referenced to produce the implementations for the Lyapunov Op
s.
The Op.grad
implementation for the Schur decomposition is perhaps the only open question.
Thanks for laying it out clearly. I'm clearly not getting it through my head that aesara straddles the line between sympy and numpy. I appreciate your patience as I learn all this.
I'm happy to work on trsyl
. As you note, the reverse-mode equations are already worked out in the same paper I'm already working from. I can submit it as a separate PR. I also have an Op
for solve_discrete_are
worked out, but this uses scipy.linalg.ordqz
, scipy.linalg.qr
, and scipy.linalg.lu
, which would all need to be implemented to avoid just directly wrapping scipy.linalg.solve_discrete_are
. QR decomposition is in tensor.nlinalg
but it doesn't appear to have gradients implemented.
I spent this morning doing some research regarding the schur decomposition, and it seems it would be non-trivial to implement. I couldn't find an implementation in TF, Torch, or mxnet. I dug around on google scholar and found implementations for gradients of SVD, QR, and LU decompositions (here and here), but not Schur or QZ.
If anyone has a reference, I'm happy to work on figuring an implementation out. But, as I'm sure you can tell from repeated interactions with me, I'm already extremely out of my depth here. Trying to work out the gradients myself from first principals is not likely to happen.
I can submit it as a separate PR.
If we can't get a gradient for the Schur decomposition in place, then a separate PR for trsyl
would be the next best thing.
I also have an
Op
forsolve_discrete_are
worked out, but this usesscipy.linalg.ordqz
,scipy.linalg.qr
, andscipy.linalg.lu
, which would all need to be implemented to avoid just directly wrappingscipy.linalg.solve_discrete_are
. QR decomposition is intensor.nlinalg
but it doesn't appear to have gradients implemented.
Same with those other decompositions (e.g. QR, LU) with known gradients; having Op
s and/or Op.grad
implementations for those would be a big help.
In the end, we really only need to do this sort of due diligence to determine exactly why something can/can't reasonably be done at a lower Op
level. Knowing specifically that the roadblock is a derivative for the Schur decomposition is incredibly useful, because we can—among other things—create an issue for it and explicitly track its relevance to other Op
implementations and, ideally, gather information and make progress on that issue over time.
I spent this morning doing some research regarding the schur decomposition, and it seems it would be non-trivial to implement. I couldn't find an implementation in TF, Torch, or mxnet. I dug around on google scholar and found implementations for gradients of SVD, QR, and LU decompositions (here and here), but not Schur or QZ.
If anyone has a reference, I'm happy to work on figuring an implementation out. But, as I'm sure you can tell from repeated interactions with me, I'm already extremely out of my depth here. Trying to work out the gradients myself from first principals is not likely to happen.
It wouldn't surprise me if someone hasn't published a solution to this one, but my first thought is that the approach would be very similar to the one for QR, since the two decompositions are related. If I recall, a common QR algorithm essentially computes the Schur decomposition, but there might be some subtleties involving real/imaginary results.
@brandonwillard But if we can't easily get Schur, we can't get lyapunov that way, no? If that's the case it sounds like this PR would be our next best bet.
@brandonwillard But if we can't easily get Schur, we can't get lyapunov that way, no? If that's the case it sounds like this PR would be our next best bet.
Yes, that's part of what I was saying. There are other options we could consider, though. For instance, it might be straightforward to identify Schur
+ trsyl
combinations and provide gradients for those—or something similar.
Codecov Report
Merging #1020 (4fc0168) into main (14c394d) will increase coverage by
4.99%
. The diff coverage is81.15%
.
:exclamation: Current head 4fc0168 differs from pull request most recent head 48965b2. Consider uploading reports for the commit 48965b2 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## main #1020 +/- ##
==========================================
+ Coverage 74.10% 79.09% +4.99%
==========================================
Files 174 173 -1
Lines 48673 48554 -119
Branches 10373 10972 +599
==========================================
+ Hits 36067 38405 +2338
+ Misses 10315 7652 -2663
- Partials 2291 2497 +206
Impacted Files | Coverage Δ | |
---|---|---|
aesara/tensor/slinalg.py | 83.79% <81.15%> (-0.61%) |
:arrow_down: |
aesara/link/jax/dispatch/extra_ops.py | 86.56% <0.00%> (-8.96%) |
:arrow_down: |
aesara/sparse/type.py | 81.57% <0.00%> (-2.83%) |
:arrow_down: |
aesara/link/vm.py | 90.26% <0.00%> (-2.30%) |
:arrow_down: |
aesara/link/numba/dispatch/scalar.py | 86.00% <0.00%> (-1.42%) |
:arrow_down: |
aesara/graph/fg.py | 87.39% <0.00%> (-1.15%) |
:arrow_down: |
aesara/compile/function/pfunc.py | 81.40% <0.00%> (-1.01%) |
:arrow_down: |
aesara/printing.py | 49.76% <0.00%> (-0.65%) |
:arrow_down: |
aesara/tensor/rewriting/shape.py | 79.33% <0.00%> (-0.59%) |
:arrow_down: |
aesara/tensor/extra_ops.py | 89.09% <0.00%> (-0.39%) |
:arrow_down: |
... and 59 more |
I hope I don't seem like a broken record when saying this: But implementing the backward mode gradient operations through eg the schur decomposition isn't ideal.
Let's say we want to compute Lyapunov(A=2 * eye, Q=-eye)
, so we have $AXA^H - X + Q = 0$ or $X = \frac{1}{3}I$.
For a given $\bar{X} $ the values for $\bar{Q} $ and $\bar{A} $ are also nicely defined, and the formulas from the paper work just fine.
But if we compute the gradients through the schur decomposition we run into trouble:
First we can notice that because $A $ is symmetric, the schur decomposition coincides with the eigenvalue decomposition. But because the eigenvalues of $A $ are not unique, the function that maps $A $ to its eigenvalue decomposition isn't a well defined function in the mathematical sense, because any orthogonal matrix $Q $ can be used as eigenvector matrix. In the forward code this isn't an issue, because we don't care which $Q $ we get as long as it is orthogonal. However, the backward operation of the eigenvalue decomposition isn't well defined because of this, which shows up in the formulas for $\bar{A} $ as a division by zero:
This makes me think that the Op as it is implemented in this PR is actually better than a rewrite that uses the schur decomposition and its gradient directly. It is reasonably stable, tested and produces values and gradients if they are well defined (ie $\lambda_i(A) \lambda_j(A) \neq 1$), and this is probably also true for most other implicitly defined functions of this kind (eg sylvester, riccatti etc). I for one would really like to have this functionality in aesara.
But if we want to have an even better solution, maybe we could get the best of both worlds: If we have forward ops for schur, trsyl, qz etc, we could write the lyaponov et al Ops as aesara.compile.builders.OpFromGraph
or something similar. This way we could tell aesara
more details about what's happening in the solver code and also optimize those expressions (for instance the current PR would compute the schur decomposition of $A$ twice, once in the forward code and once in the gradient, and aesara should be smart enough to notice this and remove the redundant work from the graph during common sub-expression elimination). We could still provide custom code for the gradients though. I'd be a disappointed though if this was a strict requirement and because nobody wants to do the work (and I think it is quite a bit of work) and because of that we don't get this functionality at all.
I hope I don't seem like a broken record when saying this: But implementing the backward mode gradient operations through eg the schur decomposition isn't ideal.
I've still yet to see how "expanding" the Lyapunov-solving function/operator into its constituent parts wouldn't be ideal. In other words, if it could be done, why wouldn't it be better than the un-expanded version?
So far, in this discussion, the only concrete issue I've noticed is whether or not one component of the expanded version could be implemented without more effort than it's theoretically worth, but that's distinct from the question(s) "Is it possible or ideal?".
Aside from that, I've already given a directly relevant example demonstrating how such expansions can be at least as performant as their un-expanded counterparts (i.e. the discrete "direct" case). Just so we're clear, the take-away from that example is that we should dispatch based on the method, so that the expanded version is used when possible, and, if need be, we can have an un-expanded Op
for just the continuous case.
Let's say we want to compute
Lyapunov(A=2 * eye, Q=-eye)
, so we have AXAH−X+Q=0 or X=13I. For a givenX¯ the values forQ¯ andA¯ are also nicely defined, and the formulas from the paper work just fine. But if we compute the gradients through the schur decomposition we run into trouble:
It sounds like you're restating one of the potential challenges behind implementing a complete gradient for the constituent Schur step, and not necessarily making a statement about the performance or quality of an expanded approach–both of which are important aspects of an ideal/not ideal approach.
Before going further, it should be clear that we're talking about one very specific "degenerate" set of inputs for which we have not even performed any sort of analysis for the un-expanded case (i.e. current implementation)–let alone a relevant comparative analysis with an expanded implementation of any form.
That said, this is really a constrained issue, and one that might be surmountable in any number of ways–some of which could have everything/nothing to do with the specifics of an Op.grad
implementation or rewriting.
This is especially relevant given that the provided example equates eigen-decompositions with Schur decompositions, which may be true in some cases at a high-level, but not true in terms of the solution-generating processes and their choices regarding mathematically ambiguous mappings, subgradients, removable singularities, etc.
If we want consistent gradient implementations, then we need to be consistent with these solution-generating processes and the definitions they use—not just the high-level mathematical definitions.
First we can notice that becauseA is symmetric, the schur decomposition coincides with the eigenvalue decomposition. But because the eigenvalues ofA are not unique, the function that mapsA to its eigenvalue decomposition isn't a well defined function in the mathematical sense, because any orthogonal matrixQ can be used as eigenvector matrix.
The actual function mapping $A$ to its eigen-decomposition (i.e. numpy.linalg.eigh
) appears to be well defined. You demonstrated this by showing that it chooses the Euclidean basis vectors by returning an identity matrix. Sure, the very broadly defined mathematical definition of an eigen-decomposition doesn't specify as much, but that's because it doesn't particularly serve the mathematical definition to do so. Nevertheless, the actual mappings we use are well-defined enough to specify valid eigenvectors, so that point is moot.
More importantly, consider a simpler function with a similar issue: the absolute value.
import numpy as np
import aesara
import aesara.tensor as at
x = at.scalar("x")
y = at.abs(x)
y_fn = aesara.function([x], [y, aesara.grad(y, x)])
y_fn(0.0)
# [array(0.), array(0.)]
As with the absolute value, it's possible to choose a specific subgradient value. I don't see why the same isn't possible/reasonable here.
In the forward code this isn't an issue, because we don't care whichQ we get as long as it is orthogonal. However, the backward operation of the eigenvalue decomposition isn't well defined because of this, which shows up in the formulas for A¯ as a division by zero:
I keep noticing mention of "forward" and "backward", but I don't yet see why these distinctions are relevant, so I can't address those aspects at the moment.
Regardless, let's see if we can quickly do something along the lines of the subgradient approach mentioned above, but for this exact example.
Currently, our Eigh.grad
is an awkward implementation that is driven by an EighGrad
Op
, making it an un-expanded implementation (i.e. its gradient doesn't generate an Aesara graph comprised of other, "lower-level" Op
s).
For comparison, we'll simply add the implementation from Giles, Mike. 2008. “An Extended Collection of Matrix Derivative Results for Forward and Reverse Mode Automatic Differentiation.” to a custom Eig
class. (Since Eig
, and most other Op
s in aesara.tensor.linalg
, don't have Op.grad
implementations, this can also serve as a test/example of some sorely needed additions.)
import numpy as np
import aesara
import aesara.tensor as at
from aesara.tensor.nlinalg import Eig, _zero_disconnected
A = at.matrix("A")
d, U = at.linalg.eigh(A)
eigh_fn = aesara.function([A], [d, U, at.grad(U.sum(), A)])
class MyEig(Eig):
def L_op(
self,
inputs,
outputs,
g_outputs,
):
(A,) = inputs
d, U = outputs
dd, dU = _zero_disconnected([d, U], g_outputs)
dD = at.diag(dd)
# Compute all differences of the elements in `d`, i.e. `d[j] - d[i]`
E = at.outer(at.ones(d.shape[0]), d) - d[..., None]
# This is what the current version of `Eigh.grad` effectively does:
# from aesara.tensor.subtensor import set_subtensor
# non_diag_mask = tm.invert(at.eye(E.shape[0], dtype="bool"))
# F = at.zeros_like(E)
# F = set_subtensor(F[non_diag_mask], tm.reciprocal(E[non_diag_mask]))
# This replaces all `d[j] == d[i]` with 0, instead of just the diagonal
# of E
F = at.switch(at.neq(E, 0.0), at.reciprocal(E), 0.0)
# TODO: This inverse probably isn't good.
dA = at.linalg.inv(U).T @ (dD + F * U.T @ dU) @ U.T
return [dA]
myeig = MyEig()
d_2, U_2 = myeig(A)
myeig_fn = aesara.function([A], [d_2, U_2, at.grad(U_2.sum(), A)])
rng = np.random.default_rng(2039)
A_val = rng.normal(size=(3, 3))
A_val = A_val.T @ A_val
A_val = A_val + A_val.T
d_val, U_val, dU_sum_val = eigh_fn(A_val)
d_val, U_val, dU_sum_val
# [array([ 1.60014544, 6.97086699, 10.3810344 ]),
# array([[ 0.60544104, -0.06363686, 0.79334198],
# [-0.65820594, -0.6004278 , 0.4541491 ],
# [-0.44744396, 0.7971429 , 0.40540979]]),
# array([[-0.0907336 , 0. , 0. ],
# [ 0.32372021, 0.14822482, 0. ],
# [-0.30375936, 0.0925893 , -0.05749122]])]
# Rough check of the eigen-decomposition
np.allclose(U_val * d_val @ np.linalg.inv(U_val), A_val)
# True
%timeit eigh_fn(A_val)
# 148 µs ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
d_2_val, U_2_val, dU_sum_2_val = myeig_fn(A_val)
d_2_val, U_2_val, dU_sum_2_val
# [array([ 1.60014544, 10.3810344 , 6.97086699]),
# array([[-0.60544104, 0.79334198, -0.06363686],
# [ 0.65820594, 0.4541491 , -0.6004278 ],
# [ 0.44744396, 0.40540979, 0.7971429 ]]),
# array([[-0.03123565, -0.12868062, -0.41475149],
# [ 0.01339011, 0.05516287, 0.17779589],
# [-0.01800771, -0.07418586, -0.23910902]])]
np.allclose(U_2_val * d_2_val @ np.linalg.inv(U_2_val), A_val)
# True
%timeit myeig_fn(A_val)
# 97.6 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Perform some rough checks to make sure that the new gradient implementation
# for our new `Op` is working:
aesara.gradient.verify_grad(lambda A: myeig(A)[0].sum(), [A_val], rng=rng)
# How about at a (previously) "undefined" input?
aesara.gradient.verify_grad(lambda A: myeig(A)[0].sum(), [np.eye(3)], rng=rng)
# What about `Eigh`?
aesara.gradient.verify_grad(lambda A: at.linalg.eigh(A)[0].sum(), [A_val], rng=rng)
aesara.gradient.verify_grad(lambda A: at.linalg.eigh(A)[0].sum(), [np.eye(3)], rng=rng)
# ValueError: ('abs_err not finite', 'array([[nan, 0., 0.],\n [nan, nan, 0.],\n [nan, nan, nan]])')
# Now, try the degenerate case:
eigh_fn(np.eye(3))
# [array([1., 1., 1.]),
# array([[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.]]),
# array([[nan, 0., 0.],
# [nan, nan, 0.],
# [nan, nan, nan]])]
myeig_fn(np.eye(3))
# [array([1., 1., 1.]),
# array([[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.]]),
# array([[0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.]])]
Like the expanded discrete Lyapunov example, this expanded gradient implementation is at least as performant as a comparable non-expanded implementation.
While the comparison isn't exact, because Eigh
makes assumptions that Eig
doesn't and produces different eigenvectors, it still sufficiently illustrates the points.
This makes me think that the Op as it is implemented in this PR is actually better than a rewrite that uses the schur decomposition and its gradient directly. It is reasonably stable, tested and produces values and gradients if they are well defined (ie λi(A)λj(A)≠1), and this is probably also true for most other implicitly defined functions of this kind (eg sylvester, riccatti etc). I for one would really like to have this functionality in aesara.
To reiterate, you've only stated one of the implementation challenges/choices, not a point regarding the relative quality of different implementations; otherwise, what you're saying appears to amount to "We have an implementation in this branch and I want to use it, so it's better than the unimplemented approaches we're discussing". It's fine to say you want this/a implementation, but it doesn't make sense to use that as a means of comparing their relevant qualities (aside from being present or not).
Also, functions are only as well-defined as they're implemented/specified: i.e. even if a common mathematical definition of a function isn't well-defined at certain points doesn't mean that a viable, refined definition of it isn't possible; however, you seem to be implying the latter.
@aseyboldt neatly summarized the possible approaches and their pros/cons in a private conversation, and one of the points he made was that what we're calling the "subgradient approach" (i.e. making "usable" Op.grad
choices for degenerate values) is not likely to work as generally or easily as we'd want.
I absolutely agree that the OpFromGraph
approach is probably our best bet for retaining the potential performance and implementation benefits of this PR's unexpanded approach and an Op
-expanded approach.
My summary from that chat (slightly edited, I hope I didn't break anything @brandonwillard ):
We know that if we split the solve op and then chain the backward ops of the computation graph of solve_lyapunov (ie schur decomposition and trsyl) we get additional singularities that we don't want: The lyapunov equation always has a solution and gradient, unless $\lambda_i(A) \lambda_j(A) = 1$ for some $i, j$.
But using the derivatives of the schur ops etc would lead to singularities if $\lambda_i(A) = \lambda_j(A)$ for some $i, j$, and since A
could have several zero eigenvalues in some applications this wouldn't be great.
We know of those 5 options:
- Just ignore the problem and hope nobody cares for those A matrices
- Use a single custom Op with custom gradient impl to directly use the manually computed backward op (this PR)
- Split the lyapunov op using OpFromGraph and overwrite the gradient there. (With or without the gradient implementations for schur and trsyl, we wouldn't need those for this solution, even if of course they'd still be great to have).
- Write a graph rewrite that transforms ops that look like they have those singularities
- Try to figure out definitions for the singularities that make things work anyway
And my current personal feelings about those would be something like this:
- About 1: I don't like ignoring the problem
- About 2: This clearly isn't perfect but I'd be fine with it, at least as long as nobody actually does the work for one of the others. We unfortunately end up doing the schur decomposition of
A
twice as well, so this should be significantly slower in some cases. - About 3: I think this is my favorite, because it is pretty close to the math, more or less straight forward to implement and relatively easy to verify. But of course it is specific to this function and doesn't generalize.
- About 4: I'd like this as well, but graph rewrites generally scare me a bit, especially if they are compilcated like this one might be, but I'd like to be pleasantly surprised by a nice rewrite that does this. I think I prefer to just generate a good graph in the first place where possible.
- About 5: I wouldn't really know where to start, and it sounds scary to me, but if it works I think this would also be nice.
This turned out more tricky than I thought, I hope we didn't scare you away @jessegrabowski :-)
Not scared away, just staying quiet to avoid making myself look like a fool (a common occurrence).
Some questions:
-
It seems the consensus is using an
OpFromGraph
then "overwriting the gradients"? I'm familiar with OpFromGraph but not with the ability to overwrite gradients. Is there another function somewhere that does this I could consult as a reference? -
Where I was planning to go was to follow @brandonwillard 's direct solver implementation, and dispatch either to that or to the continuous solver based on the size of the A matrix. This would be an intermediate step, until a solution for the schur step is found. Is this consistent with (1)?
-
It also seems it might also be necessary to implement some LAPACK functions directly (infering from #1030). trsyl keeps coming up, but so does gees implicitly (the schur solver). Are there some examples of
COps
I can look at for reference here? I will post some more questions directly in that issue, since I imagine these will end up overlapping.
Plus a comment:
- On the connection between Schur and QR, I reached out to some experts in an effort to get a better understanding of the link here. Evidently QR can be applied blockwise as a solution algorithm for schur, but there is no "simple" one-to-one computational correspondence between the two. For example, gees doesn't directly call one of the qr solvers (geqrf, geqp3, etc), which I was really hoping it would. I am going to keep reaching out to researchers and I'll let you if I come up with anything.
- It seems the consensus is using an
OpFromGraph
then "overwriting the gradients"? I'm familiar with OpFromGraph but not with the ability to overwrite gradients. Is there another function somewhere that does this I could consult as a reference?
You can define a function for the gradient of an OpFromGraph
instead of relying on the autodiff of the internal graph. See example 3 here: https://aesara.readthedocs.io/en/latest/library/compile/opfromgraph.html?highlight=OpFromGraph#opfromgraph
2. This would be an intermediate step, until a solution for the schur step is found. Is this consistent with (1)?
Yes, I believe so.
3. Are there some examples of
COps
I can look at for reference here?
aesara.tensor.blas
contains a lot of the COp
s that use BLAS at the C level; however, most/all of them are subclasses of the same GemmRelated
class, so they're not particularly good design examples.
aesara.scalar.math
has a wide array of simple COp.c_code
implementations that are probably worth looking at. As long as there's a BLAS/LAPACK-containing library that can be found by the linker, something not far from those simple implementations can be used—along with some COp.[c_libraries, c_compile_args, c_lib_dirs, c_header_dirs]
overrides that make the compiler aware of the required external libraries.
Also, if NumPy/SciPy expose their own C-level means of accessing BLAS/LAPACK functions, then one can always use those.
I updated this PR with the following changes:
- The direct solver for the discrete lyapunov case is now written in aesara, based on the implementation by @brandonwillard presented above.
- A check for complex objects has been added to avoid using the
.conj()
method when it is not needed. This should provide gradients in the case that all matrices are real, which is the most common use case I think. -
None
is no longer an accepted value for themethod
parameter insolve_discrete_lyapunov
. The default isdirect
, which calls the pure aesara implementation. - The docstring on
solve_discrete_lyapunov
has been updated to explain thatdirect
should be preferred in all cases, except when N is large, or when inputs are complex and gradients are required. - Two additional test cases were added, covering real and complex inputs to the new
_solve_discrete_lyapunov_direct
function. - The discrete lyapunov
Op
with gradients implemented "by hand" has been renamed toSolveDiscreteLyapunovBilinear
, and no longer takes amethod
parameter on initialization.
My idea with this setup is to direct users to the Aesara implementation, while leaving the "by hand" version until we can cover all the corner cases: complex gradients, native aesara implementation of the continuous case. If the consensus is to remove it all together, though, I can do that.
I think I messed this all up trying to update the branch to match the current aesara main
and resolve the conflicts, I maybe should delete it and try again?
I think I messed this all up trying to update the branch to match the current aesara
main
and resolve the conflicts, I maybe should delete it and try again?
One minute; I'll take a (local) look.
Thanks for taking the time to clean up my mess. I'll take some time to get my git situation squared away before I move forward with any other contributions.
If it helps anyone, I am also posting here. As a reference I have background in applied linear algebra. I just wanted to add that the Schur form is numerically computed using QR step algorithm (I prefer this name over QR algorithm ) which is somewhat related to the power iteration, the subspace flavour version. This is efficiently done with shifts and implicit bulge chasing, usually out of scope for the implementation from scratch, at least it does not make sense to do so as lapack version is available and it is hard to do better than that. The additional complication is if one wants the real Schur form, the upper quasi-triangular matrix with 1-by-1 and 2-by-2 blocks, where 2-by-2 blocks correspond to conjugate complex eigenvalue pairs. The C++ code implementation that does this directly from the decompositions and their properties in projective gradient like way is available for example in: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/FunctionsManual.cpp The other option would be to backprop through the steps of algorithm directly but that requires numerically stable reimplementation with autodiff support and also ensuring that the gradient is numerically stable, this sounds hard? I think the Schur decomposition is not there but following the documented code and resources in comments it should be possible to produce formula for the eigenvalues gradient if one really desired the more stable formula.
Sorry for the delay in reviewing this. It looks like Brandon approved this PR, so unless the tests fail after rebasing on main
I will be merging this. Thank you for your contribution, and your patience!