phoebe2 icon indicating copy to clipboard operation
phoebe2 copied to clipboard

MPI exception (MPI_ERR_TAG: invalid tag) when running emcee on an external HPC inside MPI

Open podesse opened this issue 3 years ago • 7 comments

Hello,

I have been attempting to perform an emcee run on an external high-performance cluster using MPI, but my run encounters a problem anytime emcee samples a set of parameters which are non-physical. It seems to send an invalid tag when communicating between MPI processes after the log-probability function tries to return -np.inf to indicate that a parameter set has failed. From what I understand, the script should be able to handle attempted sampling of non-physical parameters without crashing the entire run; if it samples a non-physical parameter set, it should just pick a new one. However, when run in MPI, this inadvertently causes the job to stop processing. I haven't enough experience working with MPI to determine if this is a bug in PHOEBE, a problem with my installation, or an issue for the compute cluster support to handle. I thought I would reach out here, in case the solution is obvious in my error message (included below).

Activated virtual environment

emcee: Exception while calling your likelihood function:
  params: [6.96748539e+00 7.77665521e+01 4.74394715e-01 2.25946374e+00
 1.80160672e+00 8.42655274e+03 6.96339481e+03]
  args: []
  kwargs: {'b': <PHOEBE Bundle: 192 parameters | contexts: dataset, distribution, system, compute, setting, component, constraint, solver>, 'params_uniqueids': ['WBytuDSNduLrZoRKJhTsRZNGfMLHKZ', 'nQwFydfdNInCTtPdCqOorIzoaLKhDi', 'dwNTmhvLHrxmssyeNzMxGXZzEOqvdy', 'mjcrpsHmjcLUWTfdpiGajwLtoDWCWs', 'WYgosfgKWoRSfItmmHqbqBJcNQukgR', 'JtAMPWkTHmoVsNilobYmpUFoVXCBeM', 'VtcCyxDFKUzdZCGirPCXYNhOVhKtss'], 'compute': 'phoebe01', 'priors': ['prior_distribution'], 'priors_combine': 'and', 'solution': 'latest', 'compute_kwargs': {'comments': '', 'expose_failed': True}, 'custom_lnprobability_callable': None, 'failed_samples_buffer': []}
  exception:
Traceback (most recent call last):
  File "/home/peodesse/senoue/lib/python3.7/site-packages/emcee/ensemble.py", line 545, in __call__
    return self.f(x, *self.args, **self.kwargs)
  File "/home/peodesse/senoue/lib/python3.7/site-packages/phoebe/solverbackends/solverbackends.py", line 166, in _lnprobability
    return _return(-np.inf, 'lnpriors = -inf')
  File "/home/peodesse/senoue/lib/python3.7/site-packages/phoebe/solverbackends/solverbackends.py", line 127, in _return
    comm.ssend(msg_tuple, 0, tag=99999999)
  File "mpi4py/MPI/Comm.pyx", line 1166, in mpi4py.MPI.Comm.ssend
  File "mpi4py/MPI/msgpickle.pxi", line 206, in mpi4py.MPI.PyMPI_ssend
mpi4py.MPI.Exception: MPI_ERR_TAG: invalid tag
Traceback (most recent call last):
  File "emcee_test.py", line 2, in <module>
    import phoebe; import json
  File "/home/peodesse/senoue/lib/python3.7/site-packages/phoebe/__init__.py", line 460, in <module>
    backend._run_worker(packet)
  File "/home/peodesse/senoue/lib/python3.7/site-packages/phoebe/solverbackends/solverbackends.py", line 1196, in _run_worker
    return self.run_worker(**packet)
  File "/home/peodesse/senoue/lib/python3.7/site-packages/phoebe/solverbackends/solverbackends.py", line 1475, in run_worker
    pool.wait()
  File "/home/peodesse/senoue/lib/python3.7/site-packages/phoebe/pool/mpi.py", line 100, in wait
    result = func(arg)
  File "/home/peodesse/senoue/lib/python3.7/site-packages/emcee/ensemble.py", line 545, in __call__
    return self.f(x, *self.args, **self.kwargs)
  File "/home/peodesse/senoue/lib/python3.7/site-packages/phoebe/solverbackends/solverbackends.py", line 166, in _lnprobability
    return _return(-np.inf, 'lnpriors = -inf')
  File "/home/peodesse/senoue/lib/python3.7/site-packages/phoebe/solverbackends/solverbackends.py", line 127, in _return
    comm.ssend(msg_tuple, 0, tag=99999999)
  File "mpi4py/MPI/Comm.pyx", line 1166, in mpi4py.MPI.Comm.ssend
  File "mpi4py/MPI/msgpickle.pxi", line 206, in mpi4py.MPI.PyMPI_ssend
mpi4py.MPI.Exception: MPI_ERR_TAG: invalid tag

I am running PHOEBE version 2.3.41. To execute the emcee run, I'm calling the script produced when I run b.export_solver() on my bundle.

Note that this is not the full error output file – rather, the same error is repeated a few times (two or three, depending on my specific settings) before the job ceases processing entirely. The error message does reference the specific parameter set chosen when this error was thrown – in order, they are: ['sma@binary', 'incl@binary', 'q', 'requiv@primary', 'requiv@secondary', 'teff@primary', 'teff@secondary' ] Now that I look at them closely, it seems one of the parameters was outside my priors, which specifically caused the log-probability function to return -infinty. However, other sampling errors which also cause the log-probability function to return -infinity (such as the Roche Lobe Overflow error) have led to the same MPI invalid tag error.

If there is any more information you need, please let me know – I'll be happy to share!

Cheers, Padraic

podesse avatar May 21 '21 12:05 podesse

I haven't seen this before, so my guess would be that your MPI version/setup is more restrictive in which tags are valid than others. If you can somehow track that down (if MPI wants specific tags or if the one we use just extends beyond some set range) by reproducing outside phoebe or digging into those lines in mpi at the end of the traceback, I'd be happy to modify the tags used by phoebe to be valid.

As a side note, I am currently working on smarter initialization options to avoid starting a walker at -inf, but I suspect that won't fully fix this for you on its own (and won't be officially out until the 2.4 release). Until then, I would suggest setting the initializing distributions such that most, if not all, return finite probabilities... otherwise those walkers may get stuck and never be productive.

kecnry avatar May 21 '21 12:05 kecnry

Ok, I'll dig into this some more. I'm running openmpi version 3.1.2, which is not the most recent version, and mpi4py 3.0.3. Does the PHOEBE 2.3 functionality work with this version of openmpi? If not, then that could be the source of my problem (and I could contact the compute cluster support to try and get an environment with the updated openmpi version). Otherwise, I'm still in the process of tracking down which tags are valid.

podesse avatar May 21 '21 13:05 podesse

Can you try changing tag=99999999 in solverbackends.py:127 to something else, i.e. tag=999? When distributing, we tag individual tasks so it should be a number greater than the number of cores you are using, but just for testing purposes, perhaps try running it on up to 998 cores and set the solver tag to 999.

aprsa avatar May 21 '21 13:05 aprsa

Huh. Changing to tag=999 appears to have worked. I'm running on 16 cores currently, and maybe expect to run on 100 cores at most (at 1 walker per each core). We'll see how efficient that is, but so long as I'm using less than 999 cores, my problem might be fixed. Hopefully I haven't spoken too soon!

podesse avatar May 21 '21 14:05 podesse

Fingers crossed! What OS is your HPC running?

aprsa avatar May 21 '21 14:05 aprsa

It appears to be running the CentOS Linux distribution (version 7). It's certainly some form of Unix, at the very least. Admittedly, I've never checked the exact OS, so I ran cat /etc/os-release from my login node. Would that give you the appropriate information?

podesse avatar May 21 '21 14:05 podesse

I'd think so. I don't have experience with centos but it's good to know for future reference, if tag overflow bites again. We'll discuss internally how to best handle this, but I suspect that offsetting tags for tasks and reserving a small set of tags for these specialized cases might solve the issue for all practical purposes.

aprsa avatar May 21 '21 14:05 aprsa