UltraNest icon indicating copy to clipboard operation
UltraNest copied to clipboard

Rare issue with ValueError and AssertionError

Open timothygebhard opened this issue 1 year ago • 3 comments

  • UltraNest version: 3.5.7
  • Python version: 3.8.10
  • Operating System: Ubuntu 20.04.2 LTS

Description

I am using Ultranest to run a maximum likelihood estimation on a somewhat complicated likelihood function (essentially the MSE of a small trained neural net on some real data; hence it is tricky to provide a full minimal example). In 99% of cases, this works very well, but occasionally, I get some combination of the following errors / warnings:

/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/netiter.py:760: RuntimeWarning: divide by zero encountered in divide
  logleft = log1p(-exp(-1. / nlive))
/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/netiter.py:761: RuntimeWarning: divide by zero encountered in divide
  logright = -1. / nlive
Traceback (most recent call last):
  File "evaluate-with-nested-sampling.py", line 195, in find_optimal_z_with_nested_sampling
    result = sampler.run(
  File "/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/integrator.py", line 2311, in run
    for result in self.run_iter(
  File "/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/integrator.py", line 2682, in run_iter
    self._update_results(main_iterator, saved_logl, saved_nodeids)
  File "/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/integrator.py", line 2786, in _update_results
    sequence, results2 = logz_sequence(self.root, self.pointpile, random=True, check_insertion_order=True)
  File "/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/netiter.py", line 1066, in logz_sequence
    main_iterator.passing_node(rootid, node, active_rootids, active_values)
  File "/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/netiter.py", line 754, in passing_node
    randompoint = np.random.beta(1, nlive, size=self.ncounters)
  File "mtrand.pyx", line 481, in numpy.random.mtrand.RandomState.beta
  File "_common.pyx", line 600, in numpy.random._common.cont
  File "_common.pyx", line 505, in numpy.random._common.cont_broadcast_2
  File "_common.pyx", line 389, in numpy.random._common.check_array_constraint
ValueError: b <= 0
Traceback (most recent call last):
  File "evaluate-with-nested-sampling.py", line 193, in find_optimal_z_with_nested_sampling
    result = sampler.run(
  File "/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/integrator.py", line 2287, in run
    for result in self.run_iter(
  File "/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/integrator.py", line 2503, in run_iter
    region_fresh = self._update_region(
  File "/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/integrator.py", line 1937, in _update_region
    _update_region_bootstrap(self.region, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
  File "/lustre/home/tgebhard/.virtualenvs/ml4ptp/lib/python3.8/site-packages/ultranest/integrator.py", line 369, in _update_region_bootstrap
    r, f = region.compute_enlargement(
  File "ultranest/mlfriends.pyx", line 855, in ultranest.mlfriends.MLFriends.compute_enlargement
  File "ultranest/mlfriends.pyx", line 367, in ultranest.mlfriends.bounding_ellipsoid
AssertionError: (array(nan), array([[0.61218843]])

The first two seem to boil down to nlive becoming zero at some point?

Interestingly, in some cases, I can "fix" the problem by restarting with a different random seed, but "keep retrying until it works" does not strike me as a very principled solution.

Edit: It seems that increasing the number of live points can also help to mitigate the problem.

What I Did

My setup is relatively simple:

np.random.seed(42)

sampler = ultranest.ReactiveNestedSampler(
    param_names=[f'z{i}' for i in range(latent_size)],
    loglike=likelihood,
    transform=prior,
    vectorized=True,
)

result = sampler.run(
    min_num_live_points=400,
    show_status=False,
    viz_callback=False,
    max_ncalls=500_000,
)

where I use the following prior and likelihood (the latter is slightly simplified):

def prior(cube: np.ndarray) -> np.ndarray:

    params = cube.copy()
    for i in range(latent_size):
        params[:, i] = 8 * (params[:, i] - 0.5)

    return params


def likelihood(params: np.ndarray) -> np.ndarray:

    # `output_true` is defined outside the function
    output_pred = my_neural_network(params)
    mse = np.mean((output_true - output_pred) ** 2, axis=1)

    if np.isnan(mse).any():
        return -1e300 * np.ones_like(mse)
    else:
        return -mse

I know that this may be all a bit vague, but any ideas / suggestions for how to debug this further would be greatly appreciated! 🙂

timothygebhard avatar Dec 09 '22 12:12 timothygebhard

The first two seem to boil down to nlive becoming zero at some point? yes

Strange, I am not sure how that can happen!

What's the dimensionality of your inference problem?

Maybe the debug.log is insightful if you store to a log_dir?

show_status=True may also show you how the number of live points changes

One issue is that if the likelihood values are exactly the same, then correct nested sampling (a relative new development) needs to remove these without replacement. So if you have likelihood plateaus, you can run out of live points.

The latest ultranest version (3.5.7, you already use it apparently) tries to circumvent this at the beginning of the run, by filling in as many until you have min_num_live_points unique ones. But for later in the run this does not work.

Maybe you can print out the likelihoods returned during the run to see if that helps debugging.

If the NN returns exactly the same value, you may have to add some tie-breaker.

JohannesBuchner avatar Dec 09 '22 15:12 JohannesBuchner

What's the dimensionality of your inference problem?

Between 1 and 5, depending on the exact model that I am evaluating. The error messages above were actually taken from a run with a 1D model.

Maybe the debug.log is insightful if you store to a log_dir?

Good point! I am indeed seeing a lot of:

[DEBUG] Plateau detected at L=-1.669579e+05, not replacing live point.

So you were definitely onto something!

However, also get those for cases where the run succeeds. Could this have something to do with the remainder_fraction? Like, in a few unlucky cases, the sampler runs out of live points before the target fraction is reached?

timothygebhard avatar Dec 09 '22 17:12 timothygebhard

yes, I suspect so.

JohannesBuchner avatar Dec 09 '22 18:12 JohannesBuchner