Emcee and MPI with mpi4py
General information:
- emcee version:3.0rc2
- platform: Ubuntu
- installation method (pip/conda/source/other?): pip install
Problem description:
Hi all,
I am pretty new of MPI parallelization and this is indeed my first try to use it on a python code
I have tried a parallelization of emcee using mpi4py in a different way w.r.t. the minimal examples proposed in the documentation of the code (which actually uses multiprocessing).
The idea behind this workaround is that I want to have each process to write its own"backend" file when emcee is running which, to my knowledge so far, cannot be realized using the "pool" option of the sampler. Furthermore, w.r.t. the "pool" method I can draw much more model at the same time with speed comparable to when I run the code without parallelization.
what I would like to know is if this is the right way to obtain what I wanted to achieve or if there's another way to do that using the "pool" method. The workaround works fine but I am concerned about the results I am drawing from this, are the results of each MCMC process really independent from one another in this way, or there's something I am missing ?
Best, F.
Minimal example:
import emcee
from mpi4py import MPI
def lnlike(theta, x, y, yerr):
m, b, lnf = theta
model = m * x + b
inv_sigma2 = 1.0/(yerr**2 + model**2*np.exp(2*lnf))
return -0.5*(np.sum((y-model)**2*inv_sigma2 - np.log(inv_sigma2)))
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
sampler={}
backends= {}
filename = "examples_"+str(rank+1)+".h5"
ndim, nwalkers = 3, 10
# use backends option of emcee to store chains
backends[str(rank)]=emcee.backends.HDFBackend(filename)
backends[str(rank)].reset(nwalkers,ndim)
# create an initialization for the walkers
pos = [p0+ 1e-4*np.random.randn(ndim) for i in range(nwalkers)]
sampler[str(rank)] = emcee.EnsembleSampler(nwalkers, ndim, lnlike, args=(x, y, yerr),backend=backends[str(rank)])
#start the actual mcmc process
sampler[str(rank)].run_mcmc(pos, 10000,progress=True)
``
There isn't anything fundamentally wrong about this approach (it can actually be useful since you have many independent ensembles in this case). But one thing that you need to be careful about is making sure that the random number generator on each process gets properly initialized or you'll end up with chains that are correlated or even the same.
I ran into a similar situation. Is it possible to random.seed() the random numbers used in emcee? Does emcee already adopt a particular way of seeding the random number generator (in this way I won't have a control)?
The EnsembleSampler creates a RandomState (that it owns) when it's initialized so the easiest way to have control over the state is to run a np.random.seed before initializing the sampler. It would be good to add a seed argument to the sampler, but it's not yet implemented.