sbi
sbi copied to clipboard
Instability/Assertion error when using multi-round SNPE with mdn density estimator
When using mutli-round SNPE for a simple toy example where I try to infer the mean of several Gaussians, I encountered an assertion error: AssertionError: NaN/Inf present in proposal posterior eval. This error occurs when using mdn as density estimator, no matter which sbi version. When using nsf or maf, the error does not occur. In order to check if this is a matter of numerical instability, I increased the epsilon value here: https://github.com/mackelab/pyknos/blob/5ea3d5b81ecc0f72110d3d03d4a6148c919f3c7c/pyknos/mdn/mdn.py#L83-L85 which didn't change anything.
Increasing the standard deviation of the Gaussians seems to reduce the probability of the assertion error to happen, but I am not sure whether their is a relation.
I attached a notebook for reproducability of the issue. reproducible_example.zip
Thanks for reporting this!
I had a look at your example. It seems the logits
of the MOG proposal,
https://github.com/mackelab/sbi/blob/bb6150e54a0ba2e7c15432d52f53c130ced2a63c/sbi/inference/snpe/snpe_c.py#L555-L620
take inf
values for some parameters sometimes. We will have to look at this in detai and this will happen probably only next week, sorry!
This happens also to me, but I think not only when using MDN's - if I'm not mistaken the default NDE for SNPE is MAF, and it happened there as well. Also I should note that this happens occasionally, and not every time.
So I debugged this now, and I think, at-least in my case, the problem is in my code, and not in SBI.
I have a procedure that subsamples theta's from the prior/proposal, though it might jitter it a bit. And this procedure can sometimes be just a little bit outside the bounds of the prior. I.e., if the prior is U[-1,1] it can give a point that is -1.01. And this point will in turn give a 0 prior probability => -inf to log probability.
I guess this does not show up in the non-sequential algorithm because of this fork in the road (in _loss of snpe_base):
I guess the nn log probability doesn't assert that all is finite.
Ok, I may spoke too soon. In other simulations I still get this error even though I make sure that the theta's are constrained to the prior... This time error does come from _log_prob_proposal_posterior_mog
, so maybe it is a problem only of using MDN's.
Yes, I have the same problem when we do multi-round SNPE. Even if I guarantee the proposal theta is within the prior, I am still having the AssertionError: NaN/Inf present in prior eval.
I have also run into this problem since upgrade to the version 0.18 for both MDN and NSF density estimators. With the NSF it happens much rarer.
Thanks for the additional info that it is not restricted to MDNs, but happens with NSF as well. We will have a look before the next release (which we are planning for the end of July)!
I think the fact that this happens with MDNs and NSFs might be separate issues. Thus, if @rbelousov has a reproducible example with NSF I would be very grateful.
I ran a couple of experiments using a linear Gaussian simulator with increasing dimensions and increasing width of the prior. In this setting the problem occurred only for mdn
during evaluation of the MoG proposal posterior.
Empirically, it happens because there is not enough training data so sample the prior densely enough. As a consequence the estimates of the MoG proposal and the posterior precision become instable and produce inf
or NaNs
:
- when inferring a 15-D Gaussian with broad uniform prior
U(0, 100)
using only 1000 simulations per round, it will occur. - reducing to 10-D with otherwise same settings would work just fine
- same when reducing the prior width to, say
U(0, 60)
I have not identified what exactly happens where, so an intermediate fix would be to:
- use more training data ;)
- use
maf
ornsf
@michaeldeistler I noticed that this version of evaluating to MoG (compared to pyknos
) does not use the Choleski decomposition to calculate the determinant, and does. not offer the option to add epsilon
to the diagonal to avoid singular matrices. This could be a starting point for fixing it.
@janfb, I wonder what is the difference with the previous version of the package. In my case mdn
was running with 10**5
samples in a round. After upgrading to 18.0
the above issue appeared. Could there be any change in metaparameters?
I see, interesting. Before you were using 17.x
?
I may not already remember precisely, I think it was 17.2 or 17.1.
We did make a fix to APT with MDNs in v0.18.0. Have to search it later (see changelog)
Leaving this here for later reference, a more minimal breaking example
import torch
from sbi.utils import BoxUniform
from sbi.inference import SNPE, simulate_for_sbi
def simulator(theta):
return theta + torch.randn(theta.shape)
_ = torch.manual_seed(0)
dim = 15
x_o = torch.zeros((1, dim))
prior = BoxUniform(-3*torch.ones(dim), 3*torch.ones(dim))
proposal = prior
inference = SNPE(prior, density_estimator="mdn")
for i in range(10):
theta, x = simulate_for_sbi(simulator, proposal, 200)
_ = inference.append_simulations(theta, x, proposal=proposal).train()
posterior = inference.build_posterior().set_default_x(x_o)
proposal = posterior
I believe to have found the bug, see this PR in the pyknos repo. Merging that PR will close this issue, but feel free to reopen this issue if the error persists on your problem.
Is there a hope to see the fix soon in sbi 0.19.2 at PyPI?
We just made a new release with the fix. To use it, you will have to also update pyknos:
pip install pyknos --upgrade
pip install sbi --upgrade
Please let us know if you still have issues.
@michaeldeistler , the issue still persists. Perhaps I should have stressed out earlier an important detail that the infinities appear in the second round of the density estimation. The first round usually runs fine. As @janfb suggested, in some cases increasing the training set solves the problem, but not always.
Hi @rbelousov, yes it is only during the second round.
Would it be possible for you to post a reproducible example? And can you ensure that this code runs for you:
import torch
from sbi.utils import BoxUniform
from sbi.inference import SNPE, simulate_for_sbi
def simulator(theta):
return theta + torch.randn(theta.shape)
_ = torch.manual_seed(0)
dim = 15
x_o = torch.zeros((1, dim))
prior = BoxUniform(-3*torch.ones(dim), 3*torch.ones(dim))
proposal = prior
inference = SNPE(prior, density_estimator="mdn")
for i in range(10):
theta, x = simulate_for_sbi(simulator, proposal, 200)
_ = inference.append_simulations(theta, x, proposal=proposal).train()
posterior = inference.build_posterior().set_default_x(x_o)
proposal = posterior
@michaeldeistler , I will forward you my materials by your office email.
Hi!
Thanks! In your email, you say that the code I posted above does not run for you. This strongly hints towards an issue in your setup.
Can you please send the output of:
import sbi
import torch
import numpy
from pyknos.version import __version__ as pyknos_version
print(sbi.__version__)
print(pyknos_version)
print(torch.__version__)
print(numpy.__version__)
In addition, I would recommend starting an entirely new conda enviornment and installing everything from scratch. The code above should run smoothly (it runs on my machine).
@janfb could you test whether the above code runs on your machine after upgrading pyknos and sbi?
@michaeldeistler I have forwarded you the output with the above version check included by email. I tested the code on two machines (personal and cluster) by installing a fresh conda envorinoment from scratch. The outcome is the same. I wonder if the problem regards the upstream PyPI repository, could it be?
I'll check it
Okey, the pyknos version on pypi is indeed buggy
Alright, could you try upgrading to pyknos 0.15.1 and try again?
Thanks a lot!
Yes, it works now! Thank you!
Awesome, thanks a lot for your time!
I'm not sure if it's related but, these problems could be due to a problem with pytorch
https://github.com/mackelab/sbi/blob/e3a041a935e7e6ae2194bcb1e8b0c4d5fb4f74ef/sbi/utils/sbiutils.py#L39-L40
I encountered crashes, under similar conditions which happens x contains only the same value in one column. In my case due to a combination of binomial samples of high probability and low sample size.
In my particular case, standard deviation goes to zero so zx goes to infinity (nan). I suspect in the call to the next function, requested memory is way too high so the program immediately terminates without giving any error indication.
this is a minimal working example to crash torch. Note that I did not have time to upgrade to the latest releases and still pytorch 1.10 and older versions of SBI, but I though perhaps its worth mentioning. Also in the pytorch issues there are some mentions to handling nan data
torch.unique(torch.tensor([[torch.nan]]), dim=0)