Platypus icon indicating copy to clipboard operation
Platypus copied to clipboard

Problem with MPI

Open aresio opened this issue 7 years ago • 7 comments

Hi!

I am trying to distribute the fitness evaluations using the MPIPool facility. Unfortunately, my code crashes with the following error message:

Traceback (most recent call last):
  File "main_mpi.py", line 125, in invoke_master
    engine.run(ITERATIONS*INDIVIDUALS) 
  File "build/bdist.linux-i686/egg/platypus/core.py", line 304, in run
  File "build/bdist.linux-i686/egg/platypus/algorithms.py", line 173, in step
  File "build/bdist.linux-i686/egg/platypus/algorithms.py", line 183, in initialize
  File "build/bdist.linux-i686/egg/platypus/algorithms.py", line 72, in initialize
  File "build/bdist.linux-i686/egg/platypus/core.py", line 277, in evaluate_all
  File "build/bdist.linux-i686/egg/platypus/evaluator.py", line 88, in evaluate_all
  File "build/bdist.linux-i686/egg/platypus/mpipool.py", line 195, in map
  File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
  File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv
  File "mpi4py/MPI/msgpickle.pxi", line 269, in mpi4py.MPI.PyMPI_recv_match
  File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
  File "mpi4py/MPI/msgpickle.pxi", line 100, in mpi4py.MPI.Pickle.cloads
TypeError: ('__init__() takes exactly 2 arguments (1 given)', <class 'platypus.mpipool.MPIPoolException'>, ())

I am surely missing something. Do you have any suggestions or insights about the cause? I'll try to provide a minimal example of my code, if necessary.

Thank you!

aresio avatar Feb 13 '18 09:02 aresio

Hi,

It looks like an error is being thrown in your evaluation function on one of the workers, but another error is happening while passing the exception details via MPI.

I would suggest trying two things. First, the easiest, is to add a try-except block to your evaluation function to see if you can catch the error, print the details, and figure out what is throwing the original exception.

Second, more complex, is to edit platypus/mpipool.py on lines 290-292. Change MPIPoolException from:


class MPIPoolException(Exception):
    def __init__(self, tb):
        self.traceback = tb

to


class MPIPoolException(Exception):
    def __init__(self, *args):
        super(MPIPoolException, self).__init__(*args)
        self.traceback = "See exception details for traceback"

This will let MPIPoolException accept any number of arguments, which will fix the TypeError reported. I'll need to investigate a more permanent fix.

dhadka avatar Feb 13 '18 14:02 dhadka

There was actually a indexing problem in one worker, masked by the exception message passing. The try/catch strategy you suggested worked perfectly, and everything seems to be working fine now. Thank you very much for your help!

aresio avatar Feb 14 '18 03:02 aresio

Hi, I have a new issue apparently related to MPI. When I use NSGA-II, after two rounds of fitness evaluations I get this Error:

Traceback (most recent call last):
  File "main_mpi.py", line 251, in <module>
    result = invoke_master(problem, PE, pool)
  File "main_mpi.py", line 140, in invoke_master
    engine.run(PE.get_iterations()*PE.get_individuals())
  File "build/bdist.linux-x86_64/egg/platypus/core.py", line 304, in run
  File "build/bdist.linux-x86_64/egg/platypus/algorithms.py", line 175, in step
  File "build/bdist.linux-x86_64/egg/platypus/algorithms.py", line 198, in iterate
  File "build/bdist.linux-x86_64/egg/platypus/core.py", line 277, in evaluate_all
  File "build/bdist.linux-x86_64/egg/platypus/evaluator.py", line 88, in evaluate_all
  File "build/bdist.linux-x86_64/egg/platypus/mpipool.py", line 195, in map
  File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
  File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv
  File "mpi4py/MPI/msgpickle.pxi", line 269, in mpi4py.MPI.PyMPI_recv_match
  File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
  File "mpi4py/MPI/msgpickle.pxi", line 100, in mpi4py.MPI.Pickle.cloads
cPickle.UnpicklingError: invalid load key, '{'.

Is there an explanation for this behavior? Thank you for your help!

aresio avatar Feb 20 '18 05:02 aresio

Is it related to this? https://github.com/dfm/emcee/issues/200 Not a lot of information. But they suggest running with python-mpi instead of python, but also something to do with pool initialisation?

An answer on SO suggests this is caused by loading something that's not pickled, but it looks burried deep in mpi4py. I suggest double checking your evaluate function is not returning anything "weird". Can you share a minimal example that fails?

https://stackoverflow.com/questions/8111078/unpicklingerror-invalid-load-key

jetuk avatar Feb 20 '18 12:02 jetuk

Hi. Interestingly, the problem disappeared when I increased the number of individuals (now I use 200) candidate solutions for NSGA-II). Maybe it was something related to the Pareto front calculation?

EDIT: no, it is now crashing again, just after multiple iterations. It seems non-determinsitic

INFO:Platypus:Closed pool evaluator
Traceback (most recent call last):
  File "main_mpi.py", line 253, in <module>
    result = invoke_master(problem, PE, pool)
  File "main_mpi.py", line 143, in invoke_master
    engine.run(PE.get_iterations()*PE.get_individuals())
  File "build/bdist.linux-x86_64/egg/platypus/core.py", line 304, in run
  File "build/bdist.linux-x86_64/egg/platypus/algorithms.py", line 175, in step
  File "build/bdist.linux-x86_64/egg/platypus/algorithms.py", line 198, in iterate
  File "build/bdist.linux-x86_64/egg/platypus/core.py", line 277, in evaluate_all
  File "build/bdist.linux-x86_64/egg/platypus/evaluator.py", line 88, in evaluate_all
  File "build/bdist.linux-x86_64/egg/platypus/mpipool.py", line 195, in map
  File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
  File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv
  File "mpi4py/MPI/msgpickle.pxi", line 269, in mpi4py.MPI.PyMPI_recv_match
  File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
  File "mpi4py/MPI/msgpickle.pxi", line 100, in mpi4py.MPI.Pickle.cloads
EOFError

aresio avatar Feb 21 '18 04:02 aresio

Update: by googling a bit, it seems to be a memory-related problem. I am performing new tests on a larger cluster, each node equipped with 128 GB of RAM. I will let you know if that finally solves the new problem.

aresio avatar Feb 21 '18 07:02 aresio

Has there been an update here? I get the same error messages (both of them) when running on hpc as well as on my local machine.

human144 avatar Nov 23 '18 09:11 human144

This issue is stale and will be closed soon. If you feel this issue is still relevant, please comment to keep it active. Please also consider working on a fix and submitting a PR.

github-actions[bot] avatar Nov 13 '22 03:11 github-actions[bot]