`cholesky_factor_corr[n]` reliably fails to initialize (from [-2,+2]) for "moderate" `n`
Using the cholesky_factor_corr[n] parameter type reliably causes initialization to fail from around n=40.
This happens because for high enough n and if uniformly sampling the unconstrained draws from [-2,+2], some of the diagonal terms of the constrained parameter round down to zero, compare https://github.com/stan-dev/math/blob/3a196d410415ee2cc56c0e51ec73a2e1970a08a4/stan/math/prim/constraint/cholesky_corr_constrain.hpp#L74
However, once sampling has started, it usually does not fail for the same reason. What "saves" regular sampling is the shrinking scale of the marginals of "reasonable" priors (and probably also posteriors), preventing the sampler to go into combination of parameter values which are frequent during initialization, but lead to the rounding-down-to-zero issue.
In fact, it seems to be so (but don't ask me why), that the marginal prior scales of the unconstrained draws from a cholesky_factor_corr[n] factor ~ lkj_corr_cholesky(1.); parameter scale such that for the unconstrained parameter corresponding to the factor[i,j] entry, the marginal prior scale is sqrt(n-j).
A simple fix to the initialization issues is to just prescale the unconstrained parameters before feeding them into the previous constraining transform. Doing this prescaling such that the marginal scales of the unconstrained parameters are all 1 allows me to "easily" sample from priors with n=64 or n=128 - which is as far as I wanted to test it right now.
Attaching a stan file with a sample implementation below. This also includes attempts to implement the constraining transform on the log scale, which may or may not have been necessary. The only "conceptual" changes to the original constraining transform is the scaling by terms of the form ... / sqrt(...).
functions {
real fused_lkj_corr_cholesky_lpdf(vector xi, real eta, int n) {
// Combines
// * https://mc-stan.org/docs/functions-reference/correlation_matrix_distributions.html#probability-density-function-1
// * https://github.com/stan-dev/math/blob/3a196d410415ee2cc56c0e51ec73a2e1970a08a4/stan/math/prim/constraint/cholesky_corr_constrain.hpp
// * https://github.com/stan-dev/math/blob/3a196d410415ee2cc56c0e51ec73a2e1970a08a4/stan/math/prim/constraint/corr_constrain.hpp
// * https://www.wolframalpha.com/input?i=Log%28Tanh%28x%29+%5E2%29
real rv = 0.;
int xii = 1;
for(i in 2:n){
real log_sos = log_square_tanh(xi[xii] / sqrt(n-1));
rv += log1m_exp(log_sos);
xii += 1;
for(j in 2:i-1){
rv += .5 * log1m_exp(log_sos);
real tmp = log_square_tanh(xi[xii] / sqrt(n-j));
rv += log1m_exp(tmp);
log_sos = log_sum_exp(log_sos, tmp + log1m_exp(log_sos));
xii += 1;
}
rv += (n - i + 2*eta-2) * .5 * log1m_exp(log_sos);
}
return rv;
}
real log_square_tanh(real x) {
return (2 * log_abs_tanh(x));
}
real log_abs_tanh(real x) {
real z = (-2 * abs(x));
return (log1m_exp(z) - log1p_exp(z));
}
}
data {
int n;
real obs;
}
parameters {
vector[((n * (n - 1)) / 2)] lkj;
}
model {
lkj ~ fused_lkj_corr_cholesky(1.0, n);
}
Hi Nico,
Yes, this is observed behavior we have a few discourse posts about it but not a math issue until now! Thanks for adding and the suggestion. I have 3-4 other parameterizations which don't show this issue and I'd like to do a bake off to see which one is generally preferable. Here's a couple of them but I have 2 others that I need to clean up. In addition, I'd like to add positive only correlation matrices which I have a parameterization for and presented at Stancon.
matrix cholesky_corr_constrain_tfp_lp (vector y, int K) {
matrix[K, K] L = identity_matrix(K);
int counter = 1;
real s;
for (i in 2 : K) {
L[i, 1:i - 1] = y[counter:counter + i - 2]';
counter += i - 1;
s = norm2(L[i, : i]);
L[i, : i] = L[i, : i] / s;
target += -(i + 1) * log(s);
}
return L;
}
matrix cholesky_corr_constrain_proposal_lp (vector y, int K) {
matrix[K, K] L = identity_matrix(K);
int counter = 1;
for (i in 2 : K) {
row_vector[i - 1] y_star = y[counter:counter + i - 2]';
real dsy = dot_self(y_star);
real alpha_r = 1 / (dsy + 1);
real gamma = sqrt(dsy + 2) * alpha_r;
L[i, : i] = append_col(gamma * y_star, alpha_r);
target += 0.5 * (i - 2) * log(dsy + 2) - i * log1p(dsy);
counter += i - 1;
}
return L;
}
The diagonal should be distributed something like
$$ L^2_{ii} \sim \text{Beta}(\alpha = \frac{K − i}{2} + \eta, \beta =\frac{i − 1}{2}) $$
I think there are a few ways to derive this and Seth Axen worked this out using directional statistics methods.
The diagonal should be distributed something like [...]
Cool cool! Based on things working so well with the sqrt scaling, I suspected that there must be a (potentially complicated or straightforward) way to derive the marginal distributions 👍