easybuild-easyblocks
easybuild-easyblocks copied to clipboard
Failing IMPI sanity check due to UCX_TLS
On a Rocky 8.7 system with Intel cascade-lake CPUs I get a failing sanity check of IMPI/2021.10.0:
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(176)........: MPID_Init(1548)..............: MPIDI_OFI_mpi_init_hook(1662): get_ep_names(419)............: OFI get endpoint name failed (ofi_init.c:419:get_ep_names:Invalid argument)
I traced this down to $UCX_TLS=all being set. If I unset (only) this variable the check runs through.
This also applies to running (at least) this hello world program manually with that module (forcing to skip the sanity check for testing)
This is in contrast to the comment in the easyblock: https://github.com/easybuilders/easybuild-easyblocks/blob/064e3e24b4638ed2e20d8ecd3370c2ae459aee68/easybuild/easyblocks/i/impi.py#L387-L389
While there is a way to set it to a different value in the easyconfig (modextravars takes precedence) there is no way to NOT set it. Hence we might need to rethink that.
I'll do some more tests with the IMPI versions we have access to next week.
@Flamefire Any updates here?
If our assumption about UCX_TLS is wrong, would it be sufficient to add an easy way to use a different value, for example via a custom easyconfig parameter for the impi easyblock?
I did a larger test run:
The 7 impi-2018* ECs fail in the sanity check with a segfault that is unrelated
On a RHEL 8.9 system with "Intel(R) Xeon(R) Platinum 8470" or "AMD EPYC 7702" I don't see an issue in 14 impi ECs I tested on each.
I only see it on a RHEL 8.7 system with "Intel(R) Xeon(R) Platinum 8276M".
Of those 14 ECs 13 succeed when I unset UCX_TLS in the sanity check step. The last one (impi-2019.9.304-iccifortcuda-2020b.eb) is giving me another strange error:
$ mpirun -n 32 /dev/shm/easybuild-tmp/eb-l9s3j6ko/tmpjum3rdnw/mpi_test
ucp_worker.c:1835 UCX ERROR too many ep configurations: 16 (max: 16)
Abort(1091215) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1139)..............:
MPIDI_OFI_mpi_init_hook(1728): OFI get address vector map failed
That happens when using more than 16 processes (mpirun -n 16 works) and seems to be a known issue in UCX < 1.11
I tried setting $UCX_TLS to each individual value of the list in https://ucx-py.readthedocs.io/en/latest/configuration.html#ucx-tls but no individual value reproduced the issue, only "all" does.