math icon indicating copy to clipboard operation
math copied to clipboard

`cholesky_factor_corr[n]` reliably fails to initialize (from [-2,+2]) for "moderate" `n`

Open nsiccha opened this issue 6 months ago • 3 comments

Using the cholesky_factor_corr[n] parameter type reliably causes initialization to fail from around n=40. This happens because for high enough n and if uniformly sampling the unconstrained draws from [-2,+2], some of the diagonal terms of the constrained parameter round down to zero, compare https://github.com/stan-dev/math/blob/3a196d410415ee2cc56c0e51ec73a2e1970a08a4/stan/math/prim/constraint/cholesky_corr_constrain.hpp#L74

However, once sampling has started, it usually does not fail for the same reason. What "saves" regular sampling is the shrinking scale of the marginals of "reasonable" priors (and probably also posteriors), preventing the sampler to go into combination of parameter values which are frequent during initialization, but lead to the rounding-down-to-zero issue.

In fact, it seems to be so (but don't ask me why), that the marginal prior scales of the unconstrained draws from a cholesky_factor_corr[n] factor ~ lkj_corr_cholesky(1.); parameter scale such that for the unconstrained parameter corresponding to the factor[i,j] entry, the marginal prior scale is sqrt(n-j).

A simple fix to the initialization issues is to just prescale the unconstrained parameters before feeding them into the previous constraining transform. Doing this prescaling such that the marginal scales of the unconstrained parameters are all 1 allows me to "easily" sample from priors with n=64 or n=128 - which is as far as I wanted to test it right now.

Attaching a stan file with a sample implementation below. This also includes attempts to implement the constraining transform on the log scale, which may or may not have been necessary. The only "conceptual" changes to the original constraining transform is the scaling by terms of the form ... / sqrt(...).


functions {
real fused_lkj_corr_cholesky_lpdf(vector xi, real eta, int n) {
        // Combines 
    // * https://mc-stan.org/docs/functions-reference/correlation_matrix_distributions.html#probability-density-function-1
    // * https://github.com/stan-dev/math/blob/3a196d410415ee2cc56c0e51ec73a2e1970a08a4/stan/math/prim/constraint/cholesky_corr_constrain.hpp
    // * https://github.com/stan-dev/math/blob/3a196d410415ee2cc56c0e51ec73a2e1970a08a4/stan/math/prim/constraint/corr_constrain.hpp
    // * https://www.wolframalpha.com/input?i=Log%28Tanh%28x%29+%5E2%29
    real rv = 0.;
    int xii = 1;
    for(i in 2:n){
        real log_sos = log_square_tanh(xi[xii] / sqrt(n-1));
        rv += log1m_exp(log_sos);
        xii += 1;
        for(j in 2:i-1){
            rv += .5 * log1m_exp(log_sos);
            real tmp = log_square_tanh(xi[xii] / sqrt(n-j));
            rv += log1m_exp(tmp);
            log_sos = log_sum_exp(log_sos, tmp + log1m_exp(log_sos));
            xii += 1;
        }
        rv += (n - i + 2*eta-2) * .5 * log1m_exp(log_sos);
    }
    return rv;
}
real log_square_tanh(real x) {
    return (2 * log_abs_tanh(x));
}
real log_abs_tanh(real x) {
    real z = (-2 * abs(x));
    return (log1m_exp(z) - log1p_exp(z));
}
}

data {
    int n;
    real obs;
}

parameters {
    vector[((n * (n - 1)) / 2)] lkj;
}

model {
    lkj ~ fused_lkj_corr_cholesky(1.0, n);
}

nsiccha avatar Jul 02 '25 09:07 nsiccha

Hi Nico,

Yes, this is observed behavior we have a few discourse posts about it but not a math issue until now! Thanks for adding and the suggestion. I have 3-4 other parameterizations which don't show this issue and I'd like to do a bake off to see which one is generally preferable. Here's a couple of them but I have 2 others that I need to clean up. In addition, I'd like to add positive only correlation matrices which I have a parameterization for and presented at Stancon.

matrix cholesky_corr_constrain_tfp_lp (vector y, int K) {
    matrix[K, K] L = identity_matrix(K);
    int counter = 1;
    real s;

    for (i in 2 : K) {
        L[i, 1:i - 1] = y[counter:counter + i - 2]';
        counter += i - 1;
        s = norm2(L[i,  : i]);
        L[i,  : i] = L[i,  : i] / s;
        target += -(i + 1) * log(s);
      }
    return L;
  }
matrix cholesky_corr_constrain_proposal_lp (vector y, int K) {
    matrix[K, K] L = identity_matrix(K);
    int counter = 1;
    
    for (i in 2 : K) {
        row_vector[i - 1] y_star = y[counter:counter + i - 2]';
        real dsy = dot_self(y_star);
        real alpha_r = 1 / (dsy  + 1);
        real gamma = sqrt(dsy + 2) * alpha_r;
        L[i, : i] = append_col(gamma * y_star, alpha_r);
        target += 0.5 * (i - 2) * log(dsy + 2) - i * log1p(dsy);
        counter += i - 1;
      }
    return L;
  }

spinkney avatar Jul 07 '25 15:07 spinkney

The diagonal should be distributed something like

$$ L^2_{ii} \sim \text{Beta}(\alpha = \frac{K − i}{2} + \eta, \beta =\frac{i − 1}{2}) $$

I think there are a few ways to derive this and Seth Axen worked this out using directional statistics methods.

spinkney avatar Jul 07 '25 15:07 spinkney

The diagonal should be distributed something like [...]

Cool cool! Based on things working so well with the sqrt scaling, I suspected that there must be a (potentially complicated or straightforward) way to derive the marginal distributions 👍

nsiccha avatar Jul 07 '25 15:07 nsiccha