ompi Add simple opal atomic store/load?

Hi, I'm looking through OPAL's atomic utilities (opal/include/opal/sys/atomic.h), and I'm noticing that while there's a large collection of atomic operations, there exist no simple store/load functions (eg. opal_atomic_store or opal_atomic_load).

Would it be a good idea / is it possible to add these operations?

I can see some alternative solutions, like using another operation (eg. opal_atomic_add_fetch -- but what about performance?), or rolling my own version (gcc/asm), but it definitely sounds nice if the solution was integrated in opal's atomic library (?).

Dec 03 '21 09:12 gkatev

I think the assumption is that single-elements loads and stores are atomic, yet not synchronizing. OPAL provides opal_atomic_rmb and opal_atomic_wmb to order loads and stores, respectively. Would that work for your use-case?

Dec 03 '21 13:12 devreal

I'm thinking of cases where one process is constantly updating (think sequence number style) a value and another is reading it. (not the case where I want to ensure ordering between two variables' acceses, where the barriers are useful). If the load/stores are not atomic, a partial value could theoretically be received.

I think writing 64-bit values on arm, or writing to non-64-bit-aligned ones on x86 is not atomic.

Dec 03 '21 13:12 gkatev

According to https://developer.arm.com/documentation/ddi0487/gb/, B2.2.1 64bit load/stores are atomic on 64-bit arm. Unaligned 64bit load/store is UB in C so I doubt the compiler will come to the rescue here (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf page 74, 7)

Having said that, I am not opposed to introducing atomic load/store to opal. But what would be the memory ordering? Sequential consistency likely will perform worse than opal_atomic_wmb + store on some architectures. We could have acquire/release semantic for atomic load/store respectively, well documented of course, and replace some of the memory barriers. I'd definitely be interested in examples where that would be feasible in OMPI and in seeing the potential performance impact.

Dec 03 '21 13:12 devreal

I'd definitely say relaxed ordering. My specific use-case is about the closest thing I can get to ordinary C writes/reads, without being prone to errors.

A variant with other-than-relaxed ordering could also be created, as is also bound to be useful somewhere.

Edit: Another point where simple writes/reads may be problematic are 64-bit writes on 32-bit x86 (?). To be honest, for my case I only need atomic loads/stores for size_t, so it's possible (as it turns out!) that for the major architectures I'm absolutely fine without explicit atomics. In any way, it would be nice for future insterested parties if these atomics where implemented, even if just via a simple C assignment (or __atomic_store in the gcc version), so that looking up information for each architecture will not be necesarry.

Dec 03 '21 13:12 gkatev

I think we'd be open to a pull request, but I'm dubious they're actually useful. We still support compilers with limited atomic support and there's not a great way to express what you're asking for without compiler support. Yes, you could provide wrappers which are essentially just load and store, and rely on the behaviors we rely on today, but I'd almost rather we make the developer think about it, rather than letting the developer assume the compiler will do things that it won't do (like enforce all the atomicity rules).

Dec 03 '21 16:12 bwbarrett

Given the new (to me!) revelations that the compiler automatically naturally aligns types (?) (since it would be UB if it didn't), which did not occur to me at the time, and since more operations than I thought are "tear-free" (eg. 64-bit ops on arm) -- though I don't have a complete image for archs beyond x86 and arm --, I wouldn't be as hard-pressed to add these.

However, even if just one case exists (64-bit op on 32-bit x86?) where simple assignments do not cut it, the ideally portable code would optimally use explicit atomic operations. I agree that the developer should be aware of what happens behind the scenes, but from an atomics library viewpoint, the role of which is (?) to take some of this burden of the developer, these basic ops seem like a basic component. Of course the additions would also serve as documentation for future eyes.

In any case, it looks to me like they are a good fit and are non-intrusive. For example, there exist a bunch of similar opal_atomic_fetch_op_xx methods, all of which also have relaxed ordering. The implementation (compiler support version) should be simple enough:

static inline int32_t opal_atomic_load_32(opal_atomic_int32_t *addr)
{
    return __atomic_load_n(addr, __ATOMIC_RELAXED);
}
static inline void opal_atomic_store_32(opal_atomic_int32_t *addr, int32_t value)
{
    __atomic_store_n(addr, value, __ATOMIC_RELAXED);
}

Apart from _32 and _64, a _size_t variant could also be added similarly to other ops. Perhpas even an _int one (!), though that one is probably minimally necesary and would stand out like an odd duck :-).

If you also agree, I can go ahead and open a PR and take it from there.

Dec 07 '21 10:12 gkatev

Just to note. Open MPI no longer supports 32-bit platforms (x86, ppc32, arm32, etc). Also keep in mind an atomic load can be done as fetch add with a 0 operand and store with atomic swap.

Dec 07 '21 14:12 hjelmn

Also keep in mind that Open MPI intends to move to using C11 atomics exclusively at some point in the future. Most of the custom atomic code will go away at that point.

Dec 07 '21 14:12 hjelmn

I am not opposed to having simple atomic load/stores in opal. However, I don't think there is a use-case for it right now because inter-thread synchronization is synchronized through locks and/or fences, so it would essentially be dead code.

Dec 07 '21 15:12 devreal

@gkatev If you need such infrastructure you can of course add it to your local branch. If that ever gets upstreamed we will have a use-case for atomic load/store and will include the them. I am also planning to revisit our current use of fences and atomics, if I come across a possible use-case I would sync up with you.

Dec 07 '21 15:12 devreal

Yes this is all reasonable, and I've already taken care of my local code, so all is well for me. Since it looks like an overhaul will be taking place at some point I do suggest their future addition! (use case: single-writer-based synchronization, in an numerically incremental manner, without extra ordering flags)

(not sure if we/you want the issue closed -- feel free to do it)

Dec 07 '21 15:12 gkatev

Turns out, all increment/decrement/assignment accesses to an _Atomic variable are atomic with sequential consistency [1]. This nullifies all the efforts to avoid atomic operations if not needed (in the opal_thread_* functions, for example) and adds atomic operations where you wouldn't expect them (OBJ_CONSTRUCT for example, and other places where values are assigned to atomic variables). See the below screenshot of a perf profile of opal_thread_add_fetch_32 where both branches (the atomic on top and the supposedly non-atomic branch at the bottom) in fact use atomic operations on x86 (compiled with GCC 11.2.0):

I am working on a patch to introduce OPAL_ATOMIC_RELAXED_LOAD and OPAL_ATOMIC_RELAXED_STORE that map to atomic_load_explicit and atomic_store_explicit with relaxed ordering for C11 atomics and regular assignment for all others implementations. It's a mess though, because finding all instances where atomic variables are assigned and no atomicity is required is tedious...

@gkatev Was that motivating your initial question about loads and stores? This struck me while doing some profiling and I didn't realize that was an issue earlier.

[1] https://en.cppreference.com/w/c/language/atomic

Jun 18 '22 20:06 devreal

So, you refer to *addr = *addr operator delta (in OPAL_THREAD_DEFINE_ATOMIC_OP), which when not using threads is meant to be non-atomic (?)

Are there any opal atomics implementations that rely on the implicit ordering of _Atomic? Would it harm not including it in the typedefs of opal_atomic_xxx_t? (or perhaps alternatively not having it when not using threads??)

My initial request would indeed be solved by functions that call atomic_load/store_explicit with relaxed ordering! You mean if my motivation was for opal_atomic_store/load with non-seqcst-implied ordering? From what I remember there where (are?) no opal atomic functions that only store or load (but e.g. there are add_fetch).

My motivation here for the existence of opal_atomic_store/load with relaxed ordering was portability. Because while in most architectures a non-atomic write/read would be equivalent, maybe on some bizarre architecture it won't be (??). Essentially, the same way that one uses __atomic_store_n(&dst, val, __ATOMIC_RELAXED) instead of dst = val, just to be safe, I also looked to use opal_atomic_store(&dst, val).

Jun 18 '22 21:06 gkatev

So, you refer to *addr = *addr operator delta (in OPAL_THREAD_DEFINE_ATOMIC_OP), which when not using threads is meant to be non-atomic (?)

That's right, it's supposed to be non-atomic addition and assignment.

Are there any opal atomics implementations that rely on the implicit ordering of _Atomic? Would it harm not including it in the typedefs of opal_atomic_xxx_t? (or perhaps alternatively not having it when not using threads??)

IIRC, the reason we included _Atomic is so that patterns like the one below are not optimized out:

while (atomic_var == 0) { ... }

Not sure where such patterns are used. Alternatively, using volatile instead (as other atomic backends do) would achieve the same effect in the example above but the canonical way in C11 is _Atomic (and it safes us from casting around).

My understanding is that throughout the code base we rely on explicit memory barriers (opal_atomic_rmb() and opal_atomic_wmb()) to ensure ordering between operations. So no code relies on any ordering guarantees just from loads and stores on _Atomic operations.

IMHO, C11 _Atomic is a mistake and the issue here is just one example. C++20 has finally introduce std::atomic_ref to allow atomic access to non-atomic variables, which really should be the norm. I will post my PR soon and leave it to the community to decide how to proceed. For the record, the difference for a local irecv - send - wait is significant: 0.22us with the superfluous atomic operations vs 0.16us with relaxed load/stores.

Jun 19 '22 20:06 devreal

ompi ompi copied to clipboard

Add simple opal atomic store/load?

ompi
ompi copied to clipboard