sbi icon indicating copy to clipboard operation
sbi copied to clipboard

Instability/Assertion error when using multi-round SNPE with mdn density estimator

Open andererka opened this issue 2 years ago • 13 comments

When using mutli-round SNPE for a simple toy example where I try to infer the mean of several Gaussians, I encountered an assertion error: AssertionError: NaN/Inf present in proposal posterior eval. This error occurs when using mdn as density estimator, no matter which sbi version. When using nsf or maf, the error does not occur. In order to check if this is a matter of numerical instability, I increased the epsilon value here: https://github.com/mackelab/pyknos/blob/5ea3d5b81ecc0f72110d3d03d4a6148c919f3c7c/pyknos/mdn/mdn.py#L83-L85 which didn't change anything.

Increasing the standard deviation of the Gaussians seems to reduce the probability of the assertion error to happen, but I am not sure whether their is a relation.

I attached a notebook for reproducability of the issue. reproducible_example.zip

andererka avatar Mar 23 '22 10:03 andererka

Thanks for reporting this!

I had a look at your example. It seems the logits of the MOG proposal,

https://github.com/mackelab/sbi/blob/bb6150e54a0ba2e7c15432d52f53c130ced2a63c/sbi/inference/snpe/snpe_c.py#L555-L620

take inf values for some parameters sometimes. We will have to look at this in detai and this will happen probably only next week, sorry!

janfb avatar Mar 29 '22 15:03 janfb

This happens also to me, but I think not only when using MDN's - if I'm not mistaken the default NDE for SNPE is MAF, and it happened there as well. Also I should note that this happens occasionally, and not every time.

MaverickMeerkat avatar Apr 04 '22 12:04 MaverickMeerkat

So I debugged this now, and I think, at-least in my case, the problem is in my code, and not in SBI.

I have a procedure that subsamples theta's from the prior/proposal, though it might jitter it a bit. And this procedure can sometimes be just a little bit outside the bounds of the prior. I.e., if the prior is U[-1,1] it can give a point that is -1.01. And this point will in turn give a 0 prior probability => -inf to log probability.

I guess this does not show up in the non-sequential algorithm because of this fork in the road (in _loss of snpe_base): image

I guess the nn log probability doesn't assert that all is finite.

MaverickMeerkat avatar Apr 10 '22 13:04 MaverickMeerkat

Ok, I may spoke too soon. In other simulations I still get this error even though I make sure that the theta's are constrained to the prior... This time error does come from _log_prob_proposal_posterior_mog, so maybe it is a problem only of using MDN's.

MaverickMeerkat avatar Apr 10 '22 14:04 MaverickMeerkat

Yes, I have the same problem when we do multi-round SNPE. Even if I guarantee the proposal theta is within the prior, I am still having the AssertionError: NaN/Inf present in prior eval.

lijingwang avatar Jun 08 '22 21:06 lijingwang

I have also run into this problem since upgrade to the version 0.18 for both MDN and NSF density estimators. With the NSF it happens much rarer.

rbelousov avatar Jul 21 '22 06:07 rbelousov

Thanks for the additional info that it is not restricted to MDNs, but happens with NSF as well. We will have a look before the next release (which we are planning for the end of July)!

janfb avatar Jul 21 '22 06:07 janfb

I think the fact that this happens with MDNs and NSFs might be separate issues. Thus, if @rbelousov has a reproducible example with NSF I would be very grateful.

michaeldeistler avatar Jul 21 '22 07:07 michaeldeistler

I ran a couple of experiments using a linear Gaussian simulator with increasing dimensions and increasing width of the prior. In this setting the problem occurred only for mdn during evaluation of the MoG proposal posterior. Empirically, it happens because there is not enough training data so sample the prior densely enough. As a consequence the estimates of the MoG proposal and the posterior precision become instable and produce inf or NaNs:

  • when inferring a 15-D Gaussian with broad uniform prior U(0, 100) using only 1000 simulations per round, it will occur.
  • reducing to 10-D with otherwise same settings would work just fine
  • same when reducing the prior width to, say U(0, 60)

I have not identified what exactly happens where, so an intermediate fix would be to:

  • use more training data ;)
  • use maf or nsf

@michaeldeistler I noticed that this version of evaluating to MoG (compared to pyknos) does not use the Choleski decomposition to calculate the determinant, and does. not offer the option to add epsilon to the diagonal to avoid singular matrices. This could be a starting point for fixing it.

janfb avatar Aug 05 '22 09:08 janfb

@janfb, I wonder what is the difference with the previous version of the package. In my case mdn was running with 10**5 samples in a round. After upgrading to 18.0 the above issue appeared. Could there be any change in metaparameters?

rbelousov avatar Aug 05 '22 11:08 rbelousov

I see, interesting. Before you were using 17.x?

janfb avatar Aug 05 '22 11:08 janfb

I may not already remember precisely, I think it was 17.2 or 17.1.

rbelousov avatar Aug 05 '22 12:08 rbelousov

We did make a fix to APT with MDNs in v0.18.0. Have to search it later (see changelog)

michaeldeistler avatar Aug 05 '22 12:08 michaeldeistler

Leaving this here for later reference, a more minimal breaking example

import torch
from sbi.utils import BoxUniform
from sbi.inference import SNPE, simulate_for_sbi

def simulator(theta):
    return theta + torch.randn(theta.shape)

_ = torch.manual_seed(0)

dim = 15
x_o = torch.zeros((1, dim))

prior = BoxUniform(-3*torch.ones(dim), 3*torch.ones(dim))

proposal = prior
inference = SNPE(prior, density_estimator="mdn")

for i in range(10):
    theta, x = simulate_for_sbi(simulator, proposal, 200)
    _ = inference.append_simulations(theta, x, proposal=proposal).train()
    posterior = inference.build_posterior().set_default_x(x_o)
    proposal = posterior

michaeldeistler avatar Aug 23 '22 14:08 michaeldeistler

I believe to have found the bug, see this PR in the pyknos repo. Merging that PR will close this issue, but feel free to reopen this issue if the error persists on your problem.

michaeldeistler avatar Aug 23 '22 15:08 michaeldeistler

Is there a hope to see the fix soon in sbi 0.19.2 at PyPI?

rbelousov avatar Aug 23 '22 17:08 rbelousov

We just made a new release with the fix. To use it, you will have to also update pyknos:

pip install pyknos --upgrade
pip install sbi --upgrade

Please let us know if you still have issues.

michaeldeistler avatar Aug 30 '22 09:08 michaeldeistler

@michaeldeistler , the issue still persists. Perhaps I should have stressed out earlier an important detail that the infinities appear in the second round of the density estimation. The first round usually runs fine. As @janfb suggested, in some cases increasing the training set solves the problem, but not always.

rbelousov avatar Sep 07 '22 06:09 rbelousov

Hi @rbelousov, yes it is only during the second round.

Would it be possible for you to post a reproducible example? And can you ensure that this code runs for you:

import torch
from sbi.utils import BoxUniform
from sbi.inference import SNPE, simulate_for_sbi

def simulator(theta):
    return theta + torch.randn(theta.shape)

_ = torch.manual_seed(0)

dim = 15
x_o = torch.zeros((1, dim))

prior = BoxUniform(-3*torch.ones(dim), 3*torch.ones(dim))

proposal = prior
inference = SNPE(prior, density_estimator="mdn")

for i in range(10):
    theta, x = simulate_for_sbi(simulator, proposal, 200)
    _ = inference.append_simulations(theta, x, proposal=proposal).train()
    posterior = inference.build_posterior().set_default_x(x_o)
    proposal = posterior

michaeldeistler avatar Sep 07 '22 07:09 michaeldeistler

@michaeldeistler , I will forward you my materials by your office email.

rbelousov avatar Sep 08 '22 08:09 rbelousov

Hi!

Thanks! In your email, you say that the code I posted above does not run for you. This strongly hints towards an issue in your setup.

Can you please send the output of:

import sbi
import torch
import numpy
from pyknos.version import __version__ as pyknos_version

print(sbi.__version__)
print(pyknos_version)
print(torch.__version__)
print(numpy.__version__)

In addition, I would recommend starting an entirely new conda enviornment and installing everything from scratch. The code above should run smoothly (it runs on my machine).

@janfb could you test whether the above code runs on your machine after upgrading pyknos and sbi?

michaeldeistler avatar Sep 08 '22 08:09 michaeldeistler

@michaeldeistler I have forwarded you the output with the above version check included by email. I tested the code on two machines (personal and cluster) by installing a fresh conda envorinoment from scratch. The outcome is the same. I wonder if the problem regards the upstream PyPI repository, could it be?

rbelousov avatar Sep 08 '22 08:09 rbelousov

I'll check it

michaeldeistler avatar Sep 08 '22 08:09 michaeldeistler

Okey, the pyknos version on pypi is indeed buggy

michaeldeistler avatar Sep 08 '22 08:09 michaeldeistler

Alright, could you try upgrading to pyknos 0.15.1 and try again?

Thanks a lot!

michaeldeistler avatar Sep 08 '22 09:09 michaeldeistler

Yes, it works now! Thank you!

rbelousov avatar Sep 08 '22 09:09 rbelousov

Awesome, thanks a lot for your time!

michaeldeistler avatar Sep 08 '22 09:09 michaeldeistler

I'm not sure if it's related but, these problems could be due to a problem with pytorch

https://github.com/mackelab/sbi/blob/e3a041a935e7e6ae2194bcb1e8b0c4d5fb4f74ef/sbi/utils/sbiutils.py#L39-L40

I encountered crashes, under similar conditions which happens x contains only the same value in one column. In my case due to a combination of binomial samples of high probability and low sample size.

In my particular case, standard deviation goes to zero so zx goes to infinity (nan). I suspect in the call to the next function, requested memory is way too high so the program immediately terminates without giving any error indication.

this is a minimal working example to crash torch. Note that I did not have time to upgrade to the latest releases and still pytorch 1.10 and older versions of SBI, but I though perhaps its worth mentioning. Also in the pytorch issues there are some mentions to handling nan data

torch.unique(torch.tensor([[torch.nan]]), dim=0)

flo-schu avatar Mar 08 '23 20:03 flo-schu