adios4dolfinx cluster EngineError running test_read_write_P

A debian user is reporting test failure when building adios4dolfinx 0.8.1.post0 on his system, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071722 https://people.debian.org/~sanvila/build-logs/202405/adios4dolfinx_0.8.1.post0-1_amd64-20240524T100158.350Z

The tests are passing on other debian project machines (and my own), so I figure the problem is related to the way openmpi distinguishes slot, hwthread, core, socket, etc when binding processes, which would be system-specific.

The error is happening in ipyparallel, so I'm not certain how much adios4dolfinx can do about it (likely the tests would need to know the specific available slots/cores/sockets). But perhaps there's a different way of configuring the test launch that's more robust.

_ ERROR at setup of test_read_write_P_2D[create_2D_mesh0-True-1-Lagrange-True] _

    @pytest.fixture(scope="module")
    def cluster():
        cluster = ipp.Cluster(engine_launcher_class="mpi", n=2)
>       rc = cluster.start_and_connect_sync()

tests/conftest.py:15: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3/dist-packages/ipyparallel/_async.py:73: in _synchronize
    return _asyncio_run(async_f(*args, **kwargs))
/usr/lib/python3/dist-packages/ipyparallel/_async.py:19: in _asyncio_run
    return loop.run_sync(lambda: asyncio.ensure_future(coro))
/usr/lib/python3/dist-packages/tornado/ioloop.py:539: in run_sync
    return future_cell[0].result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Cluster(cluster_id='1716545058-rpjw', profile='default', controller=<running>, engine_sets=['1716545059'])>
n = 2, activate = False

    async def start_and_connect(self, n=None, activate=False):
        """Single call to start a cluster and connect a client
    
        If `activate` is given, a blocking DirectView on all engines will be created
        and activated, registering `%px` magics for use in IPython
    
        Example::
    
            rc = await Cluster(engines="mpi").start_and_connect(n=8, activate=True)
    
            %px print("hello, world!")
    
        Equivalent to::
    
            await self.start_cluster(n)
            client = await self.connect_client()
            await client.wait_for_engines(n, block=False)
    
        .. versionadded:: 7.1
    
        .. versionadded:: 8.1
    
            activate argument.
        """
        if n is None:
            n = self.n
        await self.start_cluster(n=n)
        client = await self.connect_client()
    
        if n is None:
            # number of engines to wait for
            # if not specified, derive current value from EngineSets
            n = sum(engine_set.n for engine_set in self.engines.values())
    
        if n:
>           await asyncio.wrap_future(
                client.wait_for_engines(n, block=False, timeout=self.engine_timeout)
            )
E           ipyparallel.error.EngineError: Engine set stopped: {'exit_code': 1, 'pid': 63936, 'identifier': 'ipengine-1716545058-rpjw-1716545059-59766'}

/usr/lib/python3/dist-packages/ipyparallel/cluster/cluster.py:759: EngineError
------------------------------ Captured log setup ------------------------------
INFO     ipyparallel.cluster.cluster.1716545058-rpjw:cluster.py:708 Starting 2 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
WARNING  ipyparallel.cluster.cluster.1716545058-rpjw:launcher.py:336 Output for ipengine-1716545058-rpjw-1716545059-59766:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  /usr/bin/python3.12

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

WARNING  ipyparallel.cluster.cluster.1716545058-rpjw:cluster.py:721 engine set stopped 1716545059: {'exit_code': 1, 'pid': 63936, 'identifier': 'ipengine-1716545058-rpjw-1716545059-59766'}

May 31 '24 10:05 drew-parsons

@minrk, do you have any idea? (Being the ipyparallel wizard!)

May 31 '24 10:05 jorgensd

The bug reporter also reports that lscpu reports

    Thread(s) per core:   2
    Core(s) per socket:   1
    Socket(s):            1

So if I'm reading the error message right, openmpi is complaining because it's been asked to run 2 processes but thinks it only has 1 core (and it's ignoring the available hwthreads).

I think we could allow for that in the debian build scripts by setting OMPI_MCA_rmaps_base_oversubscribe=true, which might be the simplest resolution.

May 31 '24 11:05 drew-parsons

yeah, allowing oversubscribe should be the fix here. We have to set a bunch of env to get openmpi to run tests reliably on CI because it's very strict and makes a lot of assumptions by default. oversubscribe is probably the main one for real user machines.

You could probably set the oversubscribe env in your conftest to make sure folks don't run into this one.

May 31 '24 15:05 minrk

Our bug reporter confirms OMPI_MCA_rmaps_base_oversubscribe=true resolves the issue in the debian tests. I've now added it to the debian scripts.

May 31 '24 15:05 drew-parsons

Hello. Original reporter here. Enabling oversubscription worked for a while, but now with OpenMPI 5 (using the new environment variable) there is some test in test_original_checkpoint.py which makes the building machine to get stuck.

I've documented this problem in the salsa commit where I've disabled those tests:

https://salsa.debian.org/science-team/fenics/adios4dolfinx/-/commit/65a294f173a94aabe314274cccbdf0cfe15bb3bb

I'm using single-CPU virtual machines from AWS for this (mainly of types m7a.medium and r7a.medium) but I'd bet that this is easily reproducible by setting GRUB_CMDLINE_LINUX="nr_cpus=1".

Edit: I forgot to say that this is for version 0.8.1. We (Debian) have already preliminary releases of 0.9.0 in experimental, so I will test again when it's present in unstable.

Nov 07 '24 01:11 sanvila

Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable test_original_checkpoint on single-CPU systems, because otherwise the machine hangs (as if it entered an endless loop).

Is this really supposed to happen?

Nov 19 '24 20:11 sanvila

Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable test_original_checkpoint on single-CPU systems, because otherwise the machine hangs (as if it entered an endless loop).

Is this really supposed to happen?

As I am not the developer of IPython parallel or openmpi it is hard for me to do anything with how they work together.

In the library I have certain tests that should be executed in parallel, as it check specific functionality for parallel computing. If there is a nice way of check number of available processes of a system in Python, I could add a pytest skip conditional.

Nov 19 '24 21:11 jorgensd

@sanvila does oversubscription fails on a single CPU system even with openmpi's new PRTE_ environment variables (or command option equivalents)? The old OMPI_MCA_rmaps_base_oversubscribe=true can be expected to do nothing now with OpenMPI 5.

Nov 19 '24 21:11 drew-parsons

@drew-parsons Yes, it fails again, even after I put the new variables in place. Maybe this is a different issue than before and we should open a new one, but the problem is still the same (does not work ok on single-cpu systems) so for simplicity I decided to report it here as well.

@jorgensd This usually works and it's simple enough:

import os
[...]
@pytest.mark.skipif(os.cpu_count() == 1, reason="not expected to work on single-CPU machines")

Nov 19 '24 21:11 sanvila

@drew-parsons Yes, it fails again, even after I put the new variables in place. Maybe this is a different issue than before and we should open a new one, but the problem is still the same (does not work ok on single-cpu systems) so for simplicity I decided to report it here as well.

@jorgensd This usually works and it's simple enough:
import os
[...]
@pytest.mark.skipif(os.cpu_count() == 1, reason="not expected to work on single-CPU machines")

I can add this tomorrow

Nov 19 '24 22:11 jorgensd

Is it worth using xfail rather than skipif, to monitor if the cluster subsystem becomes robust enough to pass the test in the future?

Nov 19 '24 22:11 drew-parsons

Is it worth using xfail rather than skipif, to monitor if the cluster subsystem becomes robust enough to pass the test in the future?

If it is currently hanging, xfail wouldn’t be sufficient.

Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable test_original_checkpoint on single-CPU systems, because otherwise the machine hangs (as if it entered an endless loop).

Is this really supposed to happen?

Nov 20 '24 05:11 jorgensd

@sanvila do you mind testing: https://github.com/jorgensd/adios4dolfinx/pull/140 to see if it resolves the issue for you, or if I have to intercept the number of cpu's earlier?

Nov 20 '24 08:11 jorgensd

@jorgensd Yes, it works. Thanks a lot.

tests/test_numpy_vectorization.py ...................................... [ 77%]
...................................................                      [ 82%]
tests/test_original_checkpoint.py ssssssssssssssssssssssssssssssssssssss [ 85%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 91%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss   [ 97%]
tests/test_snapshot_checkpoint.py ............................           [ 99%]
tests/test_version.py .                                                  [100%]

(The Debian package may still need some fine-tuning for python 3.13, but I can see how the tests that previously hang now they are skipped).

Nov 20 '24 17:11 sanvila

Resolved in v0.9.1

Nov 21 '24 14:11 jorgensd

cluster EngineError running test_read_write_P_2D tests on one system