pymc
pymc copied to clipboard
BUG: LKJCorr breaks when used as covariance with MvNormal
Describe the issue:
LKJCorr backwards sampling is currently broken and does not properly constrain values to a positive semidefinite matrix early enough. This leads to compilation randomly failing when providing it as a covariance matrix for MvNormal and other similar multivariate distributions. The failure is inherently random, but it gets increasingly likely with larger n values and fails almost always with n>=10 and can in theory be seen with n as low as 3.
Error message is informative, as it is easy to verify the matrix reported in error has negative eigenvalues, proving it is not PSD.
This only affects the backwards sampling (i.e. logp computations), with .eval() and predictive sampling working as they should.
Reproduceable code example:
import pymc as pm
import pytensor.tensor as pt
import numpy as np
n, eta = 10, 1
# Turn LKJCorr output to a proper square corr matrix
def corr_to_mat(corr_values, n):
corr = pt.zeros((n, n))
corr = pt.set_subtensor(corr[np.triu_indices(n, k=1)], corr_values)
return corr + corr.T + pt.identity_like(corr)
# Generate Data
corr = corr_to_mat(pm.LKJCorr.dist(n=n, eta=eta),n)
y = pm.draw(pm.MvNormal.dist(mu=0, cov=corr), 100)
# PyMC Model
with pm.Model() as m:
corr_values = pm.LKJCorr('corr_values', n=n, eta=eta)
corr = corr_to_mat(corr_values,n)
y_hat = pm.MvNormal('y_hat', mu=0, cov=corr, observed=y)
idata = pm.sample()
Error message:
Traceback (most recent call last):
File "/home/velochy/pymc/test3.py", line 22, in <module>
idata = pm.sample()
^^^^^^^^^^^
File "/home/velochy/pymc/pymc/sampling/mcmc.py", line 744, in sample
model.check_start_vals(ip)
File "/home/velochy/pymc/pymc/model/core.py", line 1660, in check_start_vals
raise SamplingError(
pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'corr_values_interval__': array([ 0.96546568, 0.55627918, -0.52092415, -0.79551073, -0.04416735,
-0.61208579, -0.04889871, -0.62181685, 0.13621748, 0.32908589,
-0.38471896, -0.89088229, -0.1326561 , -0.18828583, 0.26747631,
-0.47067211, 0.51480876, -0.56323033, -0.3513053 , 0.30480191,
-0.36865109, 0.1479414 , -0.66890036, 0.98484518, -0.85993115,
0.48094929, -0.04547197, -0.07743477, -0.3381161 , 0.12565052,
-0.06450642, -0.69698417, 0.80246982, -0.67958155, -0.29175309,
-0.02445876, -0.7858127 , 0.17738997, -0.87779485, -0.27196011,
-0.52231265, 0.46271978, 0.0810375 , 0.84458683, -0.72285369])}
Logp initial evaluation results:
{'corr_values': -inf, 'y_hat': -inf}
You can call `model.debug()` for more details.
PyMC version information:
pymc 5.10.3
Context for the issue:
This arose in discussion with @ricardoV94 and @jessegrabowski in https://discourse.pymc.io/t/using-lkjcorr-together-with-mvnormal/13606/29 . Thread has more useful context
The initial point problem happens because of jittering and the insufficient IntervalTransform that is being used.
I assume model.point_logps()
still comes out finite for every variable? Otherwise it would indicate we have a problem with the moment
function that is used to provide initial valid values for sampling from a distribution.
See also discussion in https://github.com/pymc-devs/pymc-experimental/issues/13
@velochy We just merged https://github.com/pymc-devs/pytensor/pull/614, which lets us now use pt.linalg.norm
inside of pymc models. We can now fix this bug by defining a new transformation for LKJCorr. Basically we just need to copy the tensorflowprobably bijector here, then set the newly defined transformation as the default transformation for LKJCorr. Is that something you'd be interested in helping with?
Depends on how urgent it is :) currently pretty swamped but should be able to get to it within a month or so
Hi - I'm at Pydata London and will a look at this.
For anyone in the pymc hack, I'm in the back middle of the room
Hi @johncant , let me know if anything isn't clear about how to proceed
@jessegrabowski thanks a lot! Nothing is clear and I am out of my depth. However, it's great learning and I still intend to fix this issue.
@twiecki and @fonnesbeck were there and helped me understand more about pymc than I would have learned from days of hacking on it myself.
Here's what we found:
- Issue can be replicated with an even smaller example [1]
-
LKJCholeskyCov
does not have this problem [2] -
LKJCholeskyCov
did appear to have this problem when I passedsd_dist=pm.Uniform.dist(1.8, 2.2)
toMvNormal
. Chris indicated that this might cause convergence problems and we changed it topm.HalfNormal.dist(5, shape=10))
- Return type section of LKJCorr is missing from the documentation
- Typo in the code where Cholesky is spelled Choleksy
- LKJ paper is paywalled
I think it's obvious to you that the problem is the default transformation for LKJCorr and it needs to be replaced with this. For me to do this, the next steps are:
- Figure out how to draw a single sample directly from jupyter that replicates this problem
- Figure out how to get the transform from an LKJCorr object
- Understand the tensorflow bijector, why it should work, and why the current impl does not work
- Read the LKJ paper
- Port the tensorflow bijector (the easy part!)
- Test
- Update docs
- Submit PR
[1] Replicates issue as long as n is high enough
import pymc as pm
import pytensor.tensor as pt
n, eta = 10, 3
# PyMC Model
with pm.Model() as m:
corr_values = pm.LKJCorr('corr_values', n=n, eta=eta, return_matrix=True)
idata = pm.sample()
[2] Replacement with LKJCholeskyCovariance - it samples without error (but with divergences)
import pymc as pm
import pytensor.tensor as pt
import numpy as np
n, eta = 10, 1
# Turn LKJCorr output to a proper square corr matrix
def corr_to_mat(corr_values, n):
corr = pt.zeros((n, n))
corr = pt.set_subtensor(corr[np.triu_indices(n, k=1)], corr_values)
return corr + corr.T + pt.identity_like(corr)
# Generate Data
corr = corr_to_mat(pm.LKJCorr.dist(n=n, eta=eta),n)
y = pm.draw(pm.MvNormal.dist(mu=0, cov=corr), 100)
# PyMC Model
with pm.Model() as m:
#sigma = pm.Uniform('sigma', 1.9, 2.1)
#corr_values = pm.LKJCorr('corr_values', n=n, eta=eta)
cov_values = pm.LKJCholeskyCov('cov_values', n=n, eta=eta, sd_dist=pm.HalfNormal.dist(5, shape=10))
#corr = corr_to_mat(corr_values,n)
y_hat = pm.MvNormal('y_hat', mu=0, chol=cov_values[1], observed=y)
idata = pm.sample()
Progress tracking comment
- [x] Figure out how to draw a single sample directly from jupyter that replicates this problem
- [x] Figure out how to get the transform from an LKJCorr object
- [x] Understand the tensorflow bijector, why it should work, and why the current impl does not work
- [ ] Read the LKJ paper
- [x] Port the tensorflow bijector (the easy part!)
- [ ] Test
- [ ] Update docs
- [ ] Submit PR
I decided to post my progress from yesterday as a jupyter notebook: https://colab.research.google.com/drive/1UNkgxEuy6j_Eww0aMFnsfzQfQmlxG2-d?usp=sharing . Basically, I now also understand why the transform on LKJCorr is the problem and can start understanding and porting the TF transform.