snmalloc 16 byte cmpxchg

16 byte cmpxchg

Open SchrodingerZhu opened this issue 2 years ago • 4 comments

I think there are two points to improve in current cmpxchg support for 16 byte data structures:

consider arm lse
consider using __sync_bool_compare_and_swap to force gcc emit inlined instructions.

#include <atomic>

#if defined(__aarch64__) && defined(__clang__)
#  pragma clang attribute push(__attribute__((target("lse"))),apply_to=function)
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("clang attribute pop")
#elif defined(__aarch64__) && defined(__GNUC__)
#  pragma GCC push_options
#  pragma GCC target("arch=armv8-a+lse")
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("GCC pop_options")
#elif defined(__x86_64__) && defined(__clang__)
#  pragma clang attribute push(__attribute__((target("cx16"))),apply_to=function)
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("clang attribute pop")
#elif defined(__x86_64__) && defined(__GNUC__)
#  pragma GCC push_options
#  pragma GCC target("cx16")
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("GCC pop_options")
#else 
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING
#endif

template<class T>
__attribute__((always_inline)) inline bool cas(std::atomic<T> &src,
                T const& __restrict cmp,
                T const& __restrict with)
{
    auto inline_copy = [](__int128 * dst, const void * __restrict src) {
#if __has_builtin(__builtin_inline_memcpy)
    __builtin_inline_memcpy(dst, src, sizeof(__int128));
#elif __has_builtin(__builtin_memcpy)
    __builtin_memcpy(dst, src, sizeof(__int128));
#else
    ::memcpy(dst, src, sizeof(__int128));
#endif
    };
    __int128 cmp_value;
    __int128 with_value;
    inline_copy(&cmp_value, &cmp);
    inline_copy(&with_value, &with);
    return __sync_bool_compare_and_swap(reinterpret_cast<__int128 *>(&src), cmp_value, with_value);
}

struct A {
    int64_t a, b;
};

bool cas_test(std::atomic<__int128> &src,
                __int128 const& cmp,
                __int128 const& with)
{
    return cas(src, cmp, with);
}

bool cas_test(std::atomic<A> &src,
                A const& cmp,
                A const& with)
{
    return cas(src, cmp, with);
}

PLATFORM_SPECIFIC_OPTIONS_ENDING

bool cas_test2(std::atomic<__int128> &src,
                __int128 & cmp,
                __int128 & with)
{
    return src.compare_exchange_weak(cmp, with);
}

May 11 '22 10:05 SchrodingerZhu

I'm not sure why we'd need anything on Arm. Arm has load-reserved, store-exclusive, so doesn't need a second word for ABA.

The only platforms where we have a problem have the following combination of things:

An architecture with a CAS instruction but not ll/sc, so no ABA-proof primitive (i.e. x86, this is not a problem on any other architecture).
A standard library that use a compiler builtin for 128-bit CAS.
A compiler that doesn't expand that builtin to a single instruction.

This means that you need the combination of x86, libstdc++, and gcc to see the problem. GCC has an open bug report about it and so, presumably, will fix it eventually.

That said, the ABA code isn't used on any hot paths in snmalloc, so the cost of a libcall is unlikely to be noticeable.

This combination of things (bug in a single compiler, doesn't affect hot paths) means that I'm very hesitant to do something ad-hoc as a work around for that compiler.

Looking at the code, we seem not to be doing LL/SC for non-x86 architectures. I thought @nwf added that a while ago?

May 11 '22 13:05 davidchisnall

I had a look at LL/SC on Arm, but couldn't work out how to do it. The ABA primitive is designed to be able to use it, I just didn't know a good way to actually achieve it without complex inline assembly across functions.

The use of the ABA protection is only for allocating allocators, and potentially in the backend. So falling back to locks for ABA protection is actually fine in most cases.

May 11 '22 13:05 mjp41

Hi,

I initiated this issue because I found that LLVM and GCC would behave quite differently on whether to inline atomic ops. This may deliver extra linkage with libatomic.so for rust.

For example, on x86, we are requiring the compiled target to have -mcx16 but gcc will still emit outline atomic ops.

May 11 '22 14:05 SchrodingerZhu

For DW LL/SC, maybe https://github.com/taiki-e/portable-atomic/blob/main/src/imp/aarch64.rs can be considered as a reference?

May 11 '22 14:05 SchrodingerZhu

snmalloc snmalloc copied to clipboard

16 byte cmpxchg

snmalloc
snmalloc copied to clipboard