signac-flow icon indicating copy to clipboard operation
signac-flow copied to clipboard

Provide helpful error message when double-initializing MPI.

Open joaander opened this issue 1 year ago • 1 comments

Feature description

Check whether MPI is initialized when run is about to fork and launch a MPI process.

Proposed solution

import hoomd

import ctypes
import platform

system = platform.system()
extension = ''
if system == 'Darwin':
    extension = 'dylib'
elif system == 'Linux':
    extension = 'so'
elif system == 'Windows':
    extension = 'dll'

try:
    libmpi = ctypes.CDLL('libmpi.' + extension, ctypes.RTLD_GLOBAL)

    flag = ctypes.c_int()
    libmpi.MPI_Initialized(ctypes.byref(flag))

    if flag:
        print('MPI is initialized')  # Replace with an exception and a helpful message
except OSError:
    pass

Additional context

By using ctypes to call MPI_Initialized, we add no new dependencies.

Packages like mpi4py and hoomd automatically initialize MPI on import. signac-flow then forks to execute the operation srun python project.py exec .... which will import hoomd or mpi4py again. This causes an error similar to:

gl3081.arc-ts.umich.edu:2369448] OPAL ERROR: Unreachable in file ext3x_client.c at line 111
srun: error: gl3081: task 0: Exited with exit code 1
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------

Users would find a helpful error message useful to detect these cases.

There is no reliable way to prevent the double initialization except by asking users to not import these packages at the top level.

joaander avatar Feb 22 '24 17:02 joaander

To avoid conflicts, we should implement this in or after #819.

joaander avatar Feb 22 '24 17:02 joaander

I have no plans to implement this check.

joaander avatar May 28 '24 16:05 joaander