UltraNest icon indicating copy to clipboard operation
UltraNest copied to clipboard

WrappingElllipsoid.ComputeEnlargement not returning for some MPI processes in parallel

Open lazygun37 opened this issue 1 year ago • 4 comments

  • UltraNest version: 3.4.6
  • Python version: 3.8
  • Operating System: Ubuntu 20.04

Description

I'm using UltraNest in parallel mode to fit a (non-Gaussian) mixture-model to a set of 1-D data points. The weights of the mixture component are constrained to sum to unity, so I'm using the Dirichlet prior. Things work fine when I run this on a single processor, but when I run it on, say, 10 processors, the program falls over.

What I Did

I currently run my code via openmpi: mpirun.openmpi -np 10 --hostfile hostfile ./DD-SD-SUP_vectorized_fitfix_dirichlet.py

hostfile contains just a single line: localhost slots=10

My machine has 12 physical cores, so this should be fine. Note that I've also tried mpich, with the same results.

The result of this is 10x the following: Traceback (most recent call last): File "./DD-SD-SUP_vectorized_fitfix_dirichlet.py", line 1024, in result = sampler.run(min_num_live_points=400) File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/ultranest/integrator.py", line 2226, in run for result in self.run_iter( File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/ultranest/integrator.py", line 2438, in run_iter region_fresh = self._update_region( File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/ultranest/integrator.py", line 1998, in _update_region f = np.max(recv_enlarge) File "<array_function internals>", line 180, in amax File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2791, in amax return _wrapreduction(a, np.maximum, 'max', axis, None, out, File "/home/christian/Desktop/anaconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, **passkwargs) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I've been able to trace the issue down a lot more though. The underlying problem seems to be that the np.max command fails because recv_enlarge isn't the simple 1-D array containing 10 numbers that it should be. Instead, the entry that corresponds to certain processes contains -- bizarrely -- the priors that were calculated by prior_transform. So then the shape of recv_enlarge is wrong, and hence the error.

Tracking this down further shows that what's happening is that the "gather" command doesn't return correctly for some processes (and for some reason doesn't correctly block progress). And this in turn can be traced to the tregion.compute_enlargement command not returning at all for some processes.

What's even more crazy is this: if I make prior_transform return the array that was passed into it -- i.e. untransformed random numbers -- the above problems do not occur. But I have checked that the transformed arrays that are actually returned have the same shape, and reasonable entries, in all cases.

lazygun37 avatar Aug 28 '22 18:08 lazygun37

Thanks for reporting this and tracking it down.

On the last part: Did you return the passed array object or a copy of it? It may make a difference (reference to other object vs a new object).

Can you print what recv_enlarge contains exactly?

This is very odd indeed. Is it possible that the MPI commands somehow ran out of sync? I wonder if it would be possible to make a concise test code that triggers the bug, and report it upstream.

Maybe it could help to print out the shape of the MPI arrays passed before every MPI call, to find out which code segment injects the prior values that are then received elsewhere?

~~Which MPI implementation are you using? To circumvent the bug, maybe try switching to another implementation.~~ just saw that you use openmpi and tried mpich

JohannesBuchner avatar Aug 29 '22 06:08 JohannesBuchner

Good suggestion about copy vs actual passed array. It turns out the issue is maybe even more bizarre than I thought. After messing around with things for a bit, I ended up with this sort of structure in my prior_transform:

def prior_transform(uni_rands):

#... body of the function transform the uniform random numbers into Dirichlet
#    distributed numbers and stuffs them into an array called par

 assert np.shape(par) == np.shape(uni_rands)

w = 1.e-7
par2 = (1.0 - w)*par + w*uni_rands

return par2

It appears that this always works fine for "sufficiently large" w -- but that can be really small. In fact, it's worked even for w = 0!

Yet if I simply set par2 = par, it never works, an as near as I can tell, even setting just par2 = 1.0*par never works -- even though presumably that should be the same as w = 0.

I have no idea what's going on there, but I guess I at least have some sort of workaround now -- i.e. for sufficiently small w, I guess I don't really care...

I still hope to take a look at your other suggestions as well. Any other ideas would be hugely welcome, of course.

lazygun37 avatar Aug 29 '22 10:08 lazygun37

By the way: did you have any thoughts on what could cause tregion.compute_enlargement to fail returning for some processes. Because I'm pretty sure that is the underlying issue. E.g. if I stick an explicit comm.Barrier in prior to the gather/bcast stuff, I can prevent errors from happening, at the expense of the code hanging. And this is definitely because some processes didn't ever get out of the tregion.compute_enlargement call.

lazygun37 avatar Aug 29 '22 10:08 lazygun37

can you find out what this line self.build_tregion = not is_affine_transform(active_u, active_v) is doing in each process?

The behaviour could be explained if it is set differently for some processes, then the if active_p is None or not self.build_tregion: would be entered inconsistently.

JohannesBuchner avatar Aug 29 '22 12:08 JohannesBuchner