ompi
ompi copied to clipboard
opal_lifo test fails on FreeBSD amd64
OMPI 5.0.7 tests fail at the opal_lifo test on amd64 platforms running FreeBSD. This is true for all currently supported versions of FreeBSD, for all version 5 of OMPI that I have tested. I haven't tried version 4 but I can if it's useful.
For FreeBSD 14.2 on aarch64 with clang 18.1.6, all tests pass except for a few that are skipped. opal_lifo is not skipped, it passes.
For FreeBSD 14.2 on amd64 with clang 18.1.6, opal_lifo fails:
❯ cat work/openmpi-5.0.7/test/class/test-suite.log
===============================================
Open MPI 5.0.7: test/class/test-suite.log
===============================================
# TOTAL: 10
# PASS: 9
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
.. contents:: :depth: 2
FAIL: opal_lifo
===============
Failure : lifo push/pop multi-threaded with atomics
Failure : list pop all items
SUPPORT: OMPI Test failed: opal_lifo_t (2 of 7 failed)
Single thread test. Time: 0 s 2883 us 2 nsec/poppush
Atomics thread finished. Time: 0 s 27179 us 27 nsec/poppush
Atomics thread finished. Time: 0 s 9381 us 9 nsec/poppush
Atomics thread finished. Time: 0 s 9839 us 9 nsec/poppush
Atomics thread finished. Time: 0 s 10220 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 10721 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 10926 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 18153 us 18 nsec/poppush
Atomics thread finished. Time: 0 s 21205 us 21 nsec/poppush
Atomics thread finished. Time: 0 s 22446 us 22 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 22504 us 2 nsec/poppush
FAIL opal_lifo (exit status: 1)
The issue is not unique to this version of the compiler. I have the same failure with FreeBSD 15.0 on amd64 and clang 19.1.5, for example.
This issue may be related to https://github.com/open-mpi/ompi/issues/10988
According to Godbolt clang 18-20 does not support atomic operations on 16 bytes on x86_64 without the -mcx16 flag. However, with the proper flag the generated code is very similar to gcc code, which works (based on the fact that there are no pending issues on a major platform).
We need to confirm what OMPI configure script detected, and what version of the 16 bytes atomic operations it selects. This info is in config.log.
@bosilca you nailed it. Adding the -mcx16 flag to CFLAGS fixed the issue. Thank you very much!
@bosilca Good catch. Do we need to add a test into configure?
Does that mean the non-16B lifo is broken?
That's kind of good, we have a solution. But it's also bad because 1) we already have that test but apparently not picking the pieces correctly, 2) the non-16B part of the code seems broken and 3) hell broke loose as we have a broken piece of code for years.
This is the potentially related issue: https://github.com/open-mpi/ompi/issues/12979 that I mentioned on the call