math
math copied to clipboard
add Givens-based orthonormal matrix transform and inverse plus Jacobian
Description
We would like to have an orthonormal matrix type for constrained variables in Stan. That requires defining the transform and its inverse plus log Jacobian determinant in the math library first.
Among the applications are factor models and parameterizations of positive definite matrices in terms of eigenvectors.
This would amount to a data type orthonormal[P, Q]
in Stan and three functions in the math library, which have roughly this signature (see, e.g., corr_matrix_free.hpp
and corr_matrix_constrain.hpp
in directory stan/math/prim/fun
. This amounts to two files with a total of three function signatures.
orthonormal_free.hpp
Eigen::Matrix<T, 1, -1> orthonormal_free(const Matrix<T, -1, -1>& y);
orthonormal_constrain.hpp
// w/o Jacobian adjustment
Eigen::Matrix<T, -1, -1> orthonormal_constrain(const Matrix<T, -1, 1>& x, int P, int Q);
// w Jacobian adjustment
Eigen::Matrix<T, -1, -1> orthonormal_constrain(const Matrix<T, -1, 1>& x, int P, int Q, T& lp);
Example
Reference
We can follow the code and transforms laid out here:
Arya A. Pourzanjani, Richard M. Jiang, Brian Mitchell, Paul J. Atzberger, Linda R. Petzold. 2021. Bayesian Inference over the Stiefel Manifold via the Givens Representation. Bayesian Analysis.
Derivative reduction
Algorithm 1 has the constraining transform, where angles are defined by
theta[i, j] = atan2(y[i, j], y[i, i]).
Equation (4.5) lists the Jacobian adjustment, which involves terms that on the log scale are equal to
(j - i - 1) * log(cos(theta[i, j]))
We can exploit the fact that
d/du log cos(u) = -tan(u)
and the definition of theta[i, j]
to reduce to
d/d.theta[i, j] (j - i - 1) * log(cos(theta[i, j])) = (j - i - 1) * y[i, j] / y[i, i].
Auxiliary variable approach
What I don't see how to do immediately is deal with the overparameterization and unit normal "prior" that makes Pourzanjani et al.'s approach feel very much like Marsaglia's approach to unit vectors that we currently employ.
Unconstraining transform
We also need the transform to go from the constrained to the unconstrained spaces (for initialization). I couldn't find it in the paper.
Original GitHub repo
More details can probably be extracted from Pourzanjani's GitHub repo, TfRotationPca.
Relation to unit vector coding
We should be able to code unit vectors as (p, 1)
orthonormal matrices. The two approaches with auxiliary variables are closely related, and may be identical.
Other repos
We also need to add the doc for the transform to the reference manual (docs
repo) and the actual data type to the language and code generate (stanc3
repo).
Unit and negative unit determinants
@bgoodri and Pourzanjani were discussing the fact that the Givens approach leads to solutions which can have positive or negative unit determinants. Maybe we could constrain to one in the prior, but in higher dimensions, this turns into combinatorial multimodality.
Alternatives
There is more than one way to do this. Pourzanjani's paper considers the geodesic Monte Carlo (GMC) approach of Byrne and Girolami. It's also possible to use Cayley transforms.
Expected Output
Invertible transforms with Jacobians so we can sample uniformly overly over Stiefel manifolds.
Current Version:
v4.1.0
Unconstraining transform We also need the transform to go from the constrained to the unconstrained spaces (for initialization). I couldn't find it in the paper.
I don't think this is technically invertible on the entire real line. Just like the unit_vector approach isn't really a bijection since the origin is left out (see https://math.stackexchange.com/a/2629480).
@betanalpha commented on the Pourzanjani paper on discourse:
This method does not fully parameterize the space of orthogonal matrices. It instead defines a chart over a Steifel manifold that will necessarily always have a slice of parameters that it cannot describe (i.e. the chart is not isomorphic to the entire space). All of the issues with reflections and reorientations are consequences of this.
from https://discourse.mc-stan.org/t/new-transform-for-orthonormal-matrices-in-stan/2981/2
Is there a clever way to ignore these spaces for invertibility defined only on the subset?
What we do with the unit vector coding is to make the transform f
(constrained to unconstrained) one to one and the inverse transform f_inv
many to one, making sure that f_inv(f(x)) = x
for constrained x
.
Upon further review, can this method generate square orthogonal matrices with negative determinants? I can't seem to generate one when doing this in R:
rstan::expose_stan_functions("https://raw.githubusercontent.com/pourzanj/TfRotationPca/master/Stan/UniformStiefel.stan")
sign_det <- 1
counter <- 1
while (sign_det == 1) {
theta <- runif(3, min = -pi, max = pi)
W <- area_form_lp(theta, n = 3L, p = 3L)
sign_det <- sign(det(W))
counter <- counter + 1
}
What we do with the unit vector coding is to make the transform
f
(constrained to unconstrained) one to one and the inverse transformf_inv
many to one, making sure thatf_inv(f(x)) = x
for constrainedx
.
Yea, my point is more semantics rather than practical. I don't think it's called a bijection since it cannot be one-to-one. There's more at https://discourse.mc-stan.org/t/divergence-treedepth-issues-with-unit-vector/8059. In fact that thread suggests using a normal(0, 0.1) rather than standard normal for the unit vector transform.
I think this rather subtle point should be made more clear in the docs for unit vector. Similarly, I'm curious when this Givens-based orthonormal transform shows divergences or treedepth issues, which we can document as well.
Here's stan code for the Cayley method https://github.com/michaeljauch/cayley/blob/master/stiefel_unif.stan. I'm not sure if it's correct. The corresponding paper is https://arxiv.org/abs/1810.02881.
Yea, my point is more semantics rather than practical. I don't think it's called a bijection since it cannot be one-to-one.
Correct, the “embedding methods” are not bijections and we really shouldn’t be using the same naming for them, including in the code.
The embedding space to embedded surface is a projection/surjection, and the inverse “lifting” map isn’t uniquely well-defined.
I think this rather subtle point should be made more clear in the docs for unit vector. Similarly, I'm curious when this Givens-based orthonormal transform shows divergences or treedepth issues, which we can document as well.
If I remember correctly someone experimented with a proper embedding approach in the Forums (not the Givens or Cayley local charts but an actual embedding like the unit vector implementation) but was plagued by intense divergences and other fitting problems. The
The problem in all of these approaches is that the data will inform posterior concentration on only the embedded surface, without no constraints on the behavior transverse to the embedding surface. Only the default prior that is implicitly added affects those, and the intersection of these two constraints can lead to a lot of cone-like geometries that diverge at the tip.
Thanks, @betanalpha, that's all really helpful.
my point is more semantics rather than practical.
I agree. We shouldn't call a non-bijective function a bijection! I think TensorFlow Probability is the one who introduced the "bijector" data structure to encapsulate a transform, an inverse defined over the range of the transform (and possibly extended to a larger domain, hence the non-bijectiveness), and the log absolute Jacobian determinant of the inverse.
In fact that thread suggests using a normal(0, 0.1) rather than standard normal for the unit vector transform.
I take it that makes the actual parameters more unit scale?
I think this rather subtle point should be made more clear in the docs for unit vector
@spinkney If you could open a docs issue with a suggestion on which docs to fix (reference manual or user's guide or both) and what you'd like to see, I can write it.
And thanks for the reference to the other embeddings.
This kind of thing is why we're thinking about adding user-defined transforms to the language. The only question there is how to make the result clearer than spreading transform logic over three blocks as we currently do.
This kind of thing is why we're thinking about adding user-defined transforms to the language. The only question there is how to make the result clearer than spreading transform logic over three blocks as we currently do.
In my opinion we make the users implement the transformation themselves, which they can already do.
parameters { unconstrained_type unconstrained_name; }
transformed parameters { constrained type constrained_name = constraining_map(unconstrained_name); }
Zero ambiguity and the constrained implemented wherever it’s most natural in the Stan program.
We can consider implement and explore maps with more direct names, for example “pos_real_to_real” instead of “log”, but that introduces the problem that there are many maps from the positive reals to the reals other than the log. Users have to know why they’re using log instead of any of the other infinite possibilities, although this is already what the language largely forces users to do.
Personally I think much of the problem here is doing too much to abstract important model structure from users, which just facilitates unconsidered defaults.