pymc icon indicating copy to clipboard operation
pymc copied to clipboard

BUG: LKJCorr breaks when used as covariance with MvNormal

Open velochy opened this issue 1 year ago • 10 comments

Describe the issue:

LKJCorr backwards sampling is currently broken and does not properly constrain values to a positive semidefinite matrix early enough. This leads to compilation randomly failing when providing it as a covariance matrix for MvNormal and other similar multivariate distributions. The failure is inherently random, but it gets increasingly likely with larger n values and fails almost always with n>=10 and can in theory be seen with n as low as 3.

Error message is informative, as it is easy to verify the matrix reported in error has negative eigenvalues, proving it is not PSD.

This only affects the backwards sampling (i.e. logp computations), with .eval() and predictive sampling working as they should.

Reproduceable code example:

import pymc as pm
import pytensor.tensor as pt
import numpy as np

n, eta = 10, 1

# Turn LKJCorr output to a proper square corr matrix
def corr_to_mat(corr_values, n):
    corr = pt.zeros((n, n))
    corr = pt.set_subtensor(corr[np.triu_indices(n, k=1)], corr_values)
    return corr + corr.T + pt.identity_like(corr)

# Generate Data
corr = corr_to_mat(pm.LKJCorr.dist(n=n, eta=eta),n)
y = pm.draw(pm.MvNormal.dist(mu=0, cov=corr), 100)

# PyMC Model
with pm.Model() as m:
    corr_values = pm.LKJCorr('corr_values', n=n, eta=eta)
    corr = corr_to_mat(corr_values,n)
    y_hat = pm.MvNormal('y_hat', mu=0, cov=corr, observed=y)
    idata = pm.sample()

Error message:

Traceback (most recent call last):
  File "/home/velochy/pymc/test3.py", line 22, in <module>
    idata = pm.sample()
            ^^^^^^^^^^^
  File "/home/velochy/pymc/pymc/sampling/mcmc.py", line 744, in sample
    model.check_start_vals(ip)
  File "/home/velochy/pymc/pymc/model/core.py", line 1660, in check_start_vals
    raise SamplingError(
pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'corr_values_interval__': array([ 0.96546568,  0.55627918, -0.52092415, -0.79551073, -0.04416735,
       -0.61208579, -0.04889871, -0.62181685,  0.13621748,  0.32908589,
       -0.38471896, -0.89088229, -0.1326561 , -0.18828583,  0.26747631,
       -0.47067211,  0.51480876, -0.56323033, -0.3513053 ,  0.30480191,
       -0.36865109,  0.1479414 , -0.66890036,  0.98484518, -0.85993115,
        0.48094929, -0.04547197, -0.07743477, -0.3381161 ,  0.12565052,
       -0.06450642, -0.69698417,  0.80246982, -0.67958155, -0.29175309,
       -0.02445876, -0.7858127 ,  0.17738997, -0.87779485, -0.27196011,
       -0.52231265,  0.46271978,  0.0810375 ,  0.84458683, -0.72285369])}

Logp initial evaluation results:
{'corr_values': -inf, 'y_hat': -inf}
You can call `model.debug()` for more details.

PyMC version information:

pymc 5.10.3

Context for the issue:

This arose in discussion with @ricardoV94 and @jessegrabowski in https://discourse.pymc.io/t/using-lkjcorr-together-with-mvnormal/13606/29 . Thread has more useful context

velochy avatar Jan 14 '24 13:01 velochy

The initial point problem happens because of jittering and the insufficient IntervalTransform that is being used.

I assume model.point_logps() still comes out finite for every variable? Otherwise it would indicate we have a problem with the moment function that is used to provide initial valid values for sampling from a distribution.

ricardoV94 avatar Jan 14 '24 14:01 ricardoV94

See also discussion in https://github.com/pymc-devs/pymc-experimental/issues/13

ricardoV94 avatar Feb 06 '24 09:02 ricardoV94

@velochy We just merged https://github.com/pymc-devs/pytensor/pull/614, which lets us now use pt.linalg.norm inside of pymc models. We can now fix this bug by defining a new transformation for LKJCorr. Basically we just need to copy the tensorflowprobably bijector here, then set the newly defined transformation as the default transformation for LKJCorr. Is that something you'd be interested in helping with?

jessegrabowski avatar May 03 '24 06:05 jessegrabowski

Depends on how urgent it is :) currently pretty swamped but should be able to get to it within a month or so

velochy avatar May 03 '24 08:05 velochy

Hi - I'm at Pydata London and will a look at this.

johncant avatar Jun 15 '24 13:06 johncant

For anyone in the pymc hack, I'm in the back middle of the room

johncant avatar Jun 15 '24 13:06 johncant

Hi @johncant , let me know if anything isn't clear about how to proceed

jessegrabowski avatar Jun 15 '24 14:06 jessegrabowski

@jessegrabowski thanks a lot! Nothing is clear and I am out of my depth. However, it's great learning and I still intend to fix this issue.

@twiecki and @fonnesbeck were there and helped me understand more about pymc than I would have learned from days of hacking on it myself.

Here's what we found:

  • Issue can be replicated with an even smaller example [1]
  • LKJCholeskyCov does not have this problem [2]
  • LKJCholeskyCov did appear to have this problem when I passed sd_dist=pm.Uniform.dist(1.8, 2.2) to MvNormal. Chris indicated that this might cause convergence problems and we changed it to pm.HalfNormal.dist(5, shape=10))
  • Return type section of LKJCorr is missing from the documentation
  • Typo in the code where Cholesky is spelled Choleksy
  • LKJ paper is paywalled

I think it's obvious to you that the problem is the default transformation for LKJCorr and it needs to be replaced with this. For me to do this, the next steps are:

  • Figure out how to draw a single sample directly from jupyter that replicates this problem
  • Figure out how to get the transform from an LKJCorr object
  • Understand the tensorflow bijector, why it should work, and why the current impl does not work
  • Read the LKJ paper
  • Port the tensorflow bijector (the easy part!)
  • Test
  • Update docs
  • Submit PR

[1] Replicates issue as long as n is high enough

import pymc as pm
import pytensor.tensor as pt

n, eta = 10, 3

# PyMC Model
with pm.Model() as m:
    corr_values = pm.LKJCorr('corr_values', n=n, eta=eta, return_matrix=True)
    idata = pm.sample()

[2] Replacement with LKJCholeskyCovariance - it samples without error (but with divergences)

import pymc as pm
import pytensor.tensor as pt
import numpy as np

n, eta = 10, 1

# Turn LKJCorr output to a proper square corr matrix
def corr_to_mat(corr_values, n):
    corr = pt.zeros((n, n))
    corr = pt.set_subtensor(corr[np.triu_indices(n, k=1)], corr_values)
    return corr + corr.T + pt.identity_like(corr)

# Generate Data
corr = corr_to_mat(pm.LKJCorr.dist(n=n, eta=eta),n)
y = pm.draw(pm.MvNormal.dist(mu=0, cov=corr), 100)

# PyMC Model
with pm.Model() as m:
    #sigma = pm.Uniform('sigma', 1.9, 2.1)
    #corr_values = pm.LKJCorr('corr_values', n=n, eta=eta)
    cov_values = pm.LKJCholeskyCov('cov_values', n=n, eta=eta, sd_dist=pm.HalfNormal.dist(5, shape=10))
    #corr = corr_to_mat(corr_values,n)
    y_hat = pm.MvNormal('y_hat', mu=0, chol=cov_values[1], observed=y)
    idata = pm.sample()

johncant avatar Jun 16 '24 22:06 johncant

Progress tracking comment

  • [x] Figure out how to draw a single sample directly from jupyter that replicates this problem
  • [x] Figure out how to get the transform from an LKJCorr object
  • [x] Understand the tensorflow bijector, why it should work, and why the current impl does not work
  • [ ] Read the LKJ paper
  • [x] Port the tensorflow bijector (the easy part!)
  • [ ] Test
  • [ ] Update docs
  • [ ] Submit PR

johncant avatar Jun 18 '24 11:06 johncant

I decided to post my progress from yesterday as a jupyter notebook: https://colab.research.google.com/drive/1UNkgxEuy6j_Eww0aMFnsfzQfQmlxG2-d?usp=sharing . Basically, I now also understand why the transform on LKJCorr is the problem and can start understanding and porting the TF transform.

johncant avatar Jun 18 '24 11:06 johncant