ompi C11 atomics: replace _Atomic with volatile

Using the _Atomic type modifier leads to sequentially consistent atomic access on variables even in code paths where atomicity and ordering is not needed (opal_thread_* and initializations as in opal_object_t for example). By replacing _Atomic with volatile in the "public" interface of the C11 backend, we preserve the ability to suppress certain compiler optimizations while allowing for non-atomic accesses where not needed. All atomic accesses and ordering is done through the use of the opal_atomic* API throughout the code base so _Atomic is not needed.

My first attempt was to introduce non-atomic load/store macros in https://github.com/open-mpi/ompi/pull/10487 but the problem is pervasive and tracking down all instances of non-atomic accesses to atomic variables is tedious and likely going to miss places. I pointed out the general problem in https://github.com/open-mpi/ompi/issues/9722#issuecomment-1159557291.

The difference in performance is significant: a sequence of local irecv - send - wait using ob1 and btl/self shows 0.22us with current main and 0.16us with this patch for empty messages.

AFAICS, C11 _Atomic is designed for purely atomic accesses and not suitable for a large-scale mixed-use (performance-sensitive) model such as Open MPI.

Signed-off-by: Joseph Schuchart [email protected]

Jun 21 '22 15:06 devreal

Here is an example of how this issue manifests: in mca_pml_ob1_recv_req_start these three assignments:

    /* init/re-init the request */
    req->req_lock = 0;
    req->req_pipeline_depth = 0;
    req->req_bytes_received = 0;

translate to three consecutive xchg instructions on x86-64:

Those do not need to be atomic (the xchg instruction is atomic even without the lock prefix). With this patch, the assembly code does not contain atomic operations:

I get the same results with GCC 11 and Clang 15.

Jun 21 '22 16:06 devreal

@gpaulsen you can do something like the following to get the assembly code for the function above: objdump -d libmpi.so | awk -v RS= '/^[[:xdigit:]]+ mca_pml_ob1_recv_req_start/'

Jun 21 '22 16:06 bosilca

bot:ompi:retest

Jun 22 '22 14:06 jsquyres

I put together some code examples on godbolt. There are two variants: one using _Atomic and one using volatile. One function, fadd_[atomic|volatile] does what the opal wrappers do: check a global variable for thread support and fall back to non-atomic operations if threads are disabled. The second function, set[a|v] non-atomically sets a member in a structure, something typically done during structure initialization.

In both cases, using the _Atomic prefix results in atomicity in the non-atomic code paths.

Jul 13 '22 15:07 devreal

@bwbarrett Could you review?

Jul 17 '22 11:07 jsquyres

I agree that this is a band-aid but I didn't feel comfortable proposing a complete overhaul. If we want to go down that road we might want to look at other places: the Linux kernel for example uses a custom struct around an integer [1] to prevent direct load/store access and provides macros for initialization and atomic / non-atomic access [2]. Interestingly, the kernel is still using volatile and not _Atomic, despite articles a few years back talking about possibly using C11 atomics [3]. There is an overview of the atomics infrastructure in [4].

By using a custom structure around an _Atomic variable and encapsulating all accesses in macros/inline functions we can use atomic_store_explicit with memory_order_relaxed for non-atomic loads and stores. It would require touching a lot of code though. Otherwise volatile is the only way I see to prevent compiler optimizations that would break existing code and prevent atomic operations in places where they are not needed. Maybe someone has other ideas?

[1] https://github.com/torvalds/linux/blob/master/include/linux/types.h#L166 [2] https://github.com/torvalds/linux/blob/master/include/asm-generic/rwonce.h#L44 [3] https://lwn.net/Articles/691128/ [4] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r5.html

Jul 18 '22 14:07 devreal

MPICH seems to be using a similar approach: https://github.com/pmodels/mpich/blob/main/src/mpl/include/mpl_atomic_c11.h (from my limited understanding of their code base)

Jul 18 '22 15:07 devreal

@devreal my preference would be to take the approach similar to the Linux kernel and MPICH. It makes sense to me and should keep us out of trouble long term.

I'm not sure how the 5.0 RMs would feel about that change (given that it's going to have impact across the code base). @janjust, @awlauria any opinions?

If the 5.0 RMs don't want a big patch, we should discuss the tradeoff between this patch and preferring GCC-style builtins over C11 (which I think will avoid the performance problem, right?).

Jul 19 '22 13:07 bwbarrett

Yes, there will be a bit of code to touch. Using GCC __sync builtins would help somewhat but has it's drawbacks. They don't provide much control over the memory ordering, which would matter for atomic locks on x86 and certainly has an impact on arm or IBM. We do have a backend using GCC __atomic builtins already, which seem to provide the same functionality as the C11 atomic functions, but without the cast. I guess we would just prioritize them over the C11 atomics then for 5.0. I'm just not sure whether all compilers support them (Clang probably does).

Jul 19 '22 14:07 devreal

On ppc64le, objdump of mca_pml_ob1.so gave me:

0000000000005220 <00000018.plt_call.mca_pml_ob1_recv_req_start>:
    5220:       18 00 41 f8     std     r2,24(r1)
    5224:       90 88 82 e9     ld      r12,-30576(r2)
    5228:       a6 03 89 7d     mtctr   r12
    522c:       20 04 80 4e     bctr
        ...

And then I've uploaded the full disassembly to: https://gist.github.com/gpaulsen/8d96a9e9dc8c17e803a56e3da7c122ac

Now to try to parse that. :)

Aug 15 '22 13:08 gpaulsen

Closing in favor of https://github.com/open-mpi/ompi/pull/10613

Aug 16 '22 15:08 devreal

ompi ompi copied to clipboard

C11 atomics: replace _Atomic with volatile

ompi
ompi copied to clipboard