ompi icon indicating copy to clipboard operation
ompi copied to clipboard

opal_lifo test fails on FreeBSD amd64

Open LaurentChardon opened this issue 8 months ago • 6 comments

OMPI 5.0.7 tests fail at the opal_lifo test on amd64 platforms running FreeBSD. This is true for all currently supported versions of FreeBSD, for all version 5 of OMPI that I have tested. I haven't tried version 4 but I can if it's useful.

For FreeBSD 14.2 on aarch64 with clang 18.1.6, all tests pass except for a few that are skipped. opal_lifo is not skipped, it passes.

For FreeBSD 14.2 on amd64 with clang 18.1.6, opal_lifo fails:

❯ cat work/openmpi-5.0.7/test/class/test-suite.log
===============================================
   Open MPI 5.0.7: test/class/test-suite.log
===============================================

# TOTAL: 10
# PASS:  9
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: opal_lifo
===============

 Failure :  lifo push/pop multi-threaded with atomics
 Failure :  list pop all items
SUPPORT: OMPI Test failed: opal_lifo_t (2 of 7 failed)
Single thread test. Time: 0 s 2883 us 2 nsec/poppush
Atomics thread finished. Time: 0 s 27179 us 27 nsec/poppush
Atomics thread finished. Time: 0 s 9381 us 9 nsec/poppush
Atomics thread finished. Time: 0 s 9839 us 9 nsec/poppush
Atomics thread finished. Time: 0 s 10220 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 10721 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 10926 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 18153 us 18 nsec/poppush
Atomics thread finished. Time: 0 s 21205 us 21 nsec/poppush
Atomics thread finished. Time: 0 s 22446 us 22 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 22504 us 2 nsec/poppush
FAIL opal_lifo (exit status: 1)

The issue is not unique to this version of the compiler. I have the same failure with FreeBSD 15.0 on amd64 and clang 19.1.5, for example.

This issue may be related to https://github.com/open-mpi/ompi/issues/10988

LaurentChardon avatar Mar 11 '25 12:03 LaurentChardon

According to Godbolt clang 18-20 does not support atomic operations on 16 bytes on x86_64 without the -mcx16 flag. However, with the proper flag the generated code is very similar to gcc code, which works (based on the fact that there are no pending issues on a major platform).

We need to confirm what OMPI configure script detected, and what version of the 16 bytes atomic operations it selects. This info is in config.log.

bosilca avatar Mar 11 '25 14:03 bosilca

@bosilca you nailed it. Adding the -mcx16 flag to CFLAGS fixed the issue. Thank you very much!

LaurentChardon avatar Mar 11 '25 14:03 LaurentChardon

@bosilca Good catch. Do we need to add a test into configure?

jsquyres avatar Mar 11 '25 14:03 jsquyres

Does that mean the non-16B lifo is broken?

devreal avatar Mar 11 '25 15:03 devreal

That's kind of good, we have a solution. But it's also bad because 1) we already have that test but apparently not picking the pieces correctly, 2) the non-16B part of the code seems broken and 3) hell broke loose as we have a broken piece of code for years.

bosilca avatar Mar 11 '25 15:03 bosilca

This is the potentially related issue: https://github.com/open-mpi/ompi/issues/12979 that I mentioned on the call

edgargabriel avatar Mar 11 '25 15:03 edgargabriel