`libfabric=1.17.0-3` on Debian causes MPI tests to fail with `MPI_ERR_OTHER`
Sample CI failure: https://gitlab.tiker.net/inducer/meshmode/-/jobs/533461
Similar failure in grudge: https://gitlab.tiker.net/inducer/grudge/-/jobs/533485
Sample traceback:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 208, in <module>
main()
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 198, in main
run_command_line(args)
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 47, in run_command_line
run_path(sys.argv[0], run_name='__main__')
File "<frozen runpy>", line 291, in run_path
File "<frozen runpy>", line 98, in _run_module_code
File "<frozen runpy>", line 88, in _run_code
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 609, in <module>
_test_mpi_boundary_swap(dim, order, num_groups)
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 426, in _test_mpi_boundary_swap
conns = bdry_setup_helper.complete_some()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/meshmode/distributed.py", line 332, in complete_some
data = [self._internal_mpi_comm.recv(status=status)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "mpi4py/MPI/Comm.pyx", line 1438, in mpi4py.MPI.Comm.recv
File "mpi4py/MPI/msgpickle.pxi", line 341, in mpi4py.MPI.PyMPI_recv
File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv_match
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
Downgrading to libfabric (see here) appears to resolve this.
This is the code in mpi4py that ultimately fails, it's a matched receive (mrecv).
@majosm Got any ideas? (Pinging you since the two of us last touched this code.)
Maybe this could be a workaround - we disable mpi4py's mprobe in mirgecom due to a similar crash (observed in Spectrum MPI, https://github.com/illinois-ceesd/mirgecom/issues/132):
https://github.com/illinois-ceesd/mirgecom/blob/babc6d2b9859719a3ba4a45dc91a6915583f175d/mirgecom/mpi.py#L183-L186
Thanks for the tip! Though it seems that setting recv_mprobe = False does not avoid this particular issue.
Exciting news: while I don't know what exactly the issue is, OpenMPI 4.1.5-1 seems to include a fix that makes it work properly with the previously-offending version of libfabric1.