cluster EngineError running test_read_write_P_2D tests on one system
A debian user is reporting test failure when building adios4dolfinx 0.8.1.post0 on his system, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071722 https://people.debian.org/~sanvila/build-logs/202405/adios4dolfinx_0.8.1.post0-1_amd64-20240524T100158.350Z
The tests are passing on other debian project machines (and my own), so I figure the problem is related to the way openmpi distinguishes slot, hwthread, core, socket, etc when binding processes, which would be system-specific.
The error is happening in ipyparallel, so I'm not certain how much adios4dolfinx can do about it (likely the tests would need to know the specific available slots/cores/sockets). But perhaps there's a different way of configuring the test launch that's more robust.
_ ERROR at setup of test_read_write_P_2D[create_2D_mesh0-True-1-Lagrange-True] _
@pytest.fixture(scope="module")
def cluster():
cluster = ipp.Cluster(engine_launcher_class="mpi", n=2)
> rc = cluster.start_and_connect_sync()
tests/conftest.py:15:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib/python3/dist-packages/ipyparallel/_async.py:73: in _synchronize
return _asyncio_run(async_f(*args, **kwargs))
/usr/lib/python3/dist-packages/ipyparallel/_async.py:19: in _asyncio_run
return loop.run_sync(lambda: asyncio.ensure_future(coro))
/usr/lib/python3/dist-packages/tornado/ioloop.py:539: in run_sync
return future_cell[0].result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Cluster(cluster_id='1716545058-rpjw', profile='default', controller=<running>, engine_sets=['1716545059'])>
n = 2, activate = False
async def start_and_connect(self, n=None, activate=False):
"""Single call to start a cluster and connect a client
If `activate` is given, a blocking DirectView on all engines will be created
and activated, registering `%px` magics for use in IPython
Example::
rc = await Cluster(engines="mpi").start_and_connect(n=8, activate=True)
%px print("hello, world!")
Equivalent to::
await self.start_cluster(n)
client = await self.connect_client()
await client.wait_for_engines(n, block=False)
.. versionadded:: 7.1
.. versionadded:: 8.1
activate argument.
"""
if n is None:
n = self.n
await self.start_cluster(n=n)
client = await self.connect_client()
if n is None:
# number of engines to wait for
# if not specified, derive current value from EngineSets
n = sum(engine_set.n for engine_set in self.engines.values())
if n:
> await asyncio.wrap_future(
client.wait_for_engines(n, block=False, timeout=self.engine_timeout)
)
E ipyparallel.error.EngineError: Engine set stopped: {'exit_code': 1, 'pid': 63936, 'identifier': 'ipengine-1716545058-rpjw-1716545059-59766'}
/usr/lib/python3/dist-packages/ipyparallel/cluster/cluster.py:759: EngineError
------------------------------ Captured log setup ------------------------------
INFO ipyparallel.cluster.cluster.1716545058-rpjw:cluster.py:708 Starting 2 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
WARNING ipyparallel.cluster.cluster.1716545058-rpjw:launcher.py:336 Output for ipengine-1716545058-rpjw-1716545059-59766:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:
/usr/bin/python3.12
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
WARNING ipyparallel.cluster.cluster.1716545058-rpjw:cluster.py:721 engine set stopped 1716545059: {'exit_code': 1, 'pid': 63936, 'identifier': 'ipengine-1716545058-rpjw-1716545059-59766'}
@minrk, do you have any idea? (Being the ipyparallel wizard!)
The bug reporter also reports that lscpu reports
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
So if I'm reading the error message right, openmpi is complaining because it's been asked to run 2 processes but thinks it only has 1 core (and it's ignoring the available hwthreads).
I think we could allow for that in the debian build scripts by setting OMPI_MCA_rmaps_base_oversubscribe=true, which might be the simplest resolution.
yeah, allowing oversubscribe should be the fix here. We have to set a bunch of env to get openmpi to run tests reliably on CI because it's very strict and makes a lot of assumptions by default. oversubscribe is probably the main one for real user machines.
You could probably set the oversubscribe env in your conftest to make sure folks don't run into this one.
Our bug reporter confirms OMPI_MCA_rmaps_base_oversubscribe=true resolves the issue in the debian tests. I've now added it to the debian scripts.
Hello. Original reporter here. Enabling oversubscription worked for a while, but now with OpenMPI 5 (using the new environment variable) there is some test in test_original_checkpoint.py which makes the building machine to get stuck.
I've documented this problem in the salsa commit where I've disabled those tests:
https://salsa.debian.org/science-team/fenics/adios4dolfinx/-/commit/65a294f173a94aabe314274cccbdf0cfe15bb3bb
I'm using single-CPU virtual machines from AWS for this (mainly of types m7a.medium and r7a.medium) but I'd bet that this is easily reproducible by setting GRUB_CMDLINE_LINUX="nr_cpus=1".
Edit: I forgot to say that this is for version 0.8.1. We (Debian) have already preliminary releases of 0.9.0 in experimental, so I will test again when it's present in unstable.
Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable test_original_checkpoint on single-CPU systems, because otherwise the machine hangs (as if it entered an endless loop).
Is this really supposed to happen?
Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable
test_original_checkpointon single-CPU systems, because otherwise the machine hangs (as if it entered an endless loop).Is this really supposed to happen?
As I am not the developer of IPython parallel or openmpi it is hard for me to do anything with how they work together.
In the library I have certain tests that should be executed in parallel, as it check specific functionality for parallel computing. If there is a nice way of check number of available processes of a system in Python, I could add a pytest skip conditional.
@sanvila does oversubscription fails on a single CPU system even with openmpi's new PRTE_ environment variables (or command option equivalents)? The old OMPI_MCA_rmaps_base_oversubscribe=true can be expected to do nothing now with OpenMPI 5.
@drew-parsons Yes, it fails again, even after I put the new variables in place. Maybe this is a different issue than before and we should open a new one, but the problem is still the same (does not work ok on single-cpu systems) so for simplicity I decided to report it here as well.
@jorgensd This usually works and it's simple enough:
import os
[...]
@pytest.mark.skipif(os.cpu_count() == 1, reason="not expected to work on single-CPU machines")
@drew-parsons Yes, it fails again, even after I put the new variables in place. Maybe this is a different issue than before and we should open a new one, but the problem is still the same (does not work ok on single-cpu systems) so for simplicity I decided to report it here as well.
@jorgensd This usually works and it's simple enough:
import os [...] @pytest.mark.skipif(os.cpu_count() == 1, reason="not expected to work on single-CPU machines")
I can add this tomorrow
Is it worth using xfail rather than skipif, to monitor if the cluster subsystem becomes robust enough to pass the test in the future?
Is it worth using
xfailrather thanskipif, to monitor if the cluster subsystem becomes robust enough to pass the test in the future?
If it is currently hanging, xfail wouldn’t be sufficient.
Hello. Version 0.9.0 is now in Debian unstable, and we still have to disable
test_original_checkpointon single-CPU systems, because otherwise the machine hangs (as if it entered an endless loop).Is this really supposed to happen?
@sanvila do you mind testing: https://github.com/jorgensd/adios4dolfinx/pull/140 to see if it resolves the issue for you, or if I have to intercept the number of cpu's earlier?
@jorgensd Yes, it works. Thanks a lot.
tests/test_numpy_vectorization.py ...................................... [ 77%]
................................................... [ 82%]
tests/test_original_checkpoint.py ssssssssssssssssssssssssssssssssssssss [ 85%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 91%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 97%]
tests/test_snapshot_checkpoint.py ............................ [ 99%]
tests/test_version.py . [100%]
(The Debian package may still need some fine-tuning for python 3.13, but I can see how the tests that previously hang now they are skipped).
Resolved in v0.9.1