Platypus
Platypus copied to clipboard
Problem with MPI
Hi!
I am trying to distribute the fitness evaluations using the MPIPool facility. Unfortunately, my code crashes with the following error message:
Traceback (most recent call last):
File "main_mpi.py", line 125, in invoke_master
engine.run(ITERATIONS*INDIVIDUALS)
File "build/bdist.linux-i686/egg/platypus/core.py", line 304, in run
File "build/bdist.linux-i686/egg/platypus/algorithms.py", line 173, in step
File "build/bdist.linux-i686/egg/platypus/algorithms.py", line 183, in initialize
File "build/bdist.linux-i686/egg/platypus/algorithms.py", line 72, in initialize
File "build/bdist.linux-i686/egg/platypus/core.py", line 277, in evaluate_all
File "build/bdist.linux-i686/egg/platypus/evaluator.py", line 88, in evaluate_all
File "build/bdist.linux-i686/egg/platypus/mpipool.py", line 195, in map
File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv
File "mpi4py/MPI/msgpickle.pxi", line 269, in mpi4py.MPI.PyMPI_recv_match
File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
File "mpi4py/MPI/msgpickle.pxi", line 100, in mpi4py.MPI.Pickle.cloads
TypeError: ('__init__() takes exactly 2 arguments (1 given)', <class 'platypus.mpipool.MPIPoolException'>, ())
I am surely missing something. Do you have any suggestions or insights about the cause? I'll try to provide a minimal example of my code, if necessary.
Thank you!
Hi,
It looks like an error is being thrown in your evaluation function on one of the workers, but another error is happening while passing the exception details via MPI.
I would suggest trying two things. First, the easiest, is to add a try-except block to your evaluation function to see if you can catch the error, print the details, and figure out what is throwing the original exception.
Second, more complex, is to edit platypus/mpipool.py on lines 290-292. Change MPIPoolException from:
class MPIPoolException(Exception):
def __init__(self, tb):
self.traceback = tb
to
class MPIPoolException(Exception):
def __init__(self, *args):
super(MPIPoolException, self).__init__(*args)
self.traceback = "See exception details for traceback"
This will let MPIPoolException accept any number of arguments, which will fix the TypeError reported. I'll need to investigate a more permanent fix.
There was actually a indexing problem in one worker, masked by the exception message passing. The try/catch strategy you suggested worked perfectly, and everything seems to be working fine now. Thank you very much for your help!
Hi, I have a new issue apparently related to MPI. When I use NSGA-II, after two rounds of fitness evaluations I get this Error:
Traceback (most recent call last):
File "main_mpi.py", line 251, in <module>
result = invoke_master(problem, PE, pool)
File "main_mpi.py", line 140, in invoke_master
engine.run(PE.get_iterations()*PE.get_individuals())
File "build/bdist.linux-x86_64/egg/platypus/core.py", line 304, in run
File "build/bdist.linux-x86_64/egg/platypus/algorithms.py", line 175, in step
File "build/bdist.linux-x86_64/egg/platypus/algorithms.py", line 198, in iterate
File "build/bdist.linux-x86_64/egg/platypus/core.py", line 277, in evaluate_all
File "build/bdist.linux-x86_64/egg/platypus/evaluator.py", line 88, in evaluate_all
File "build/bdist.linux-x86_64/egg/platypus/mpipool.py", line 195, in map
File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv
File "mpi4py/MPI/msgpickle.pxi", line 269, in mpi4py.MPI.PyMPI_recv_match
File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
File "mpi4py/MPI/msgpickle.pxi", line 100, in mpi4py.MPI.Pickle.cloads
cPickle.UnpicklingError: invalid load key, '{'.
Is there an explanation for this behavior? Thank you for your help!
Is it related to this? https://github.com/dfm/emcee/issues/200
Not a lot of information. But they suggest running with python-mpi
instead of python
, but also something to do with pool initialisation?
An answer on SO suggests this is caused by loading something that's not pickled, but it looks burried deep in mpi4py
. I suggest double checking your evaluate function is not returning anything "weird". Can you share a minimal example that fails?
https://stackoverflow.com/questions/8111078/unpicklingerror-invalid-load-key
Hi. Interestingly, the problem disappeared when I increased the number of individuals (now I use 200) candidate solutions for NSGA-II). Maybe it was something related to the Pareto front calculation?
EDIT: no, it is now crashing again, just after multiple iterations. It seems non-determinsitic
INFO:Platypus:Closed pool evaluator
Traceback (most recent call last):
File "main_mpi.py", line 253, in <module>
result = invoke_master(problem, PE, pool)
File "main_mpi.py", line 143, in invoke_master
engine.run(PE.get_iterations()*PE.get_individuals())
File "build/bdist.linux-x86_64/egg/platypus/core.py", line 304, in run
File "build/bdist.linux-x86_64/egg/platypus/algorithms.py", line 175, in step
File "build/bdist.linux-x86_64/egg/platypus/algorithms.py", line 198, in iterate
File "build/bdist.linux-x86_64/egg/platypus/core.py", line 277, in evaluate_all
File "build/bdist.linux-x86_64/egg/platypus/evaluator.py", line 88, in evaluate_all
File "build/bdist.linux-x86_64/egg/platypus/mpipool.py", line 195, in map
File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv
File "mpi4py/MPI/msgpickle.pxi", line 269, in mpi4py.MPI.PyMPI_recv_match
File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
File "mpi4py/MPI/msgpickle.pxi", line 100, in mpi4py.MPI.Pickle.cloads
EOFError
Update: by googling a bit, it seems to be a memory-related problem. I am performing new tests on a larger cluster, each node equipped with 128 GB of RAM. I will let you know if that finally solves the new problem.
Has there been an update here? I get the same error messages (both of them) when running on hpc as well as on my local machine.
This issue is stale and will be closed soon. If you feel this issue is still relevant, please comment to keep it active. Please also consider working on a fix and submitting a PR.