snmalloc
snmalloc copied to clipboard
16 byte cmpxchg
I think there are two points to improve in current cmpxchg support for 16 byte data structures:
- consider arm lse
- consider using
__sync_bool_compare_and_swap
to force gcc emit inlined instructions.
#include <atomic>
#if defined(__aarch64__) && defined(__clang__)
# pragma clang attribute push(__attribute__((target("lse"))),apply_to=function)
# define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("clang attribute pop")
#elif defined(__aarch64__) && defined(__GNUC__)
# pragma GCC push_options
# pragma GCC target("arch=armv8-a+lse")
# define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("GCC pop_options")
#elif defined(__x86_64__) && defined(__clang__)
# pragma clang attribute push(__attribute__((target("cx16"))),apply_to=function)
# define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("clang attribute pop")
#elif defined(__x86_64__) && defined(__GNUC__)
# pragma GCC push_options
# pragma GCC target("cx16")
# define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("GCC pop_options")
#else
# define PLATFORM_SPECIFIC_OPTIONS_ENDING
#endif
template<class T>
__attribute__((always_inline)) inline bool cas(std::atomic<T> &src,
T const& __restrict cmp,
T const& __restrict with)
{
auto inline_copy = [](__int128 * dst, const void * __restrict src) {
#if __has_builtin(__builtin_inline_memcpy)
__builtin_inline_memcpy(dst, src, sizeof(__int128));
#elif __has_builtin(__builtin_memcpy)
__builtin_memcpy(dst, src, sizeof(__int128));
#else
::memcpy(dst, src, sizeof(__int128));
#endif
};
__int128 cmp_value;
__int128 with_value;
inline_copy(&cmp_value, &cmp);
inline_copy(&with_value, &with);
return __sync_bool_compare_and_swap(reinterpret_cast<__int128 *>(&src), cmp_value, with_value);
}
struct A {
int64_t a, b;
};
bool cas_test(std::atomic<__int128> &src,
__int128 const& cmp,
__int128 const& with)
{
return cas(src, cmp, with);
}
bool cas_test(std::atomic<A> &src,
A const& cmp,
A const& with)
{
return cas(src, cmp, with);
}
PLATFORM_SPECIFIC_OPTIONS_ENDING
bool cas_test2(std::atomic<__int128> &src,
__int128 & cmp,
__int128 & with)
{
return src.compare_exchange_weak(cmp, with);
}
I'm not sure why we'd need anything on Arm. Arm has load-reserved, store-exclusive, so doesn't need a second word for ABA.
The only platforms where we have a problem have the following combination of things:
- An architecture with a CAS instruction but not ll/sc, so no ABA-proof primitive (i.e. x86, this is not a problem on any other architecture).
- A standard library that use a compiler builtin for 128-bit CAS.
- A compiler that doesn't expand that builtin to a single instruction.
This means that you need the combination of x86, libstdc++, and gcc to see the problem. GCC has an open bug report about it and so, presumably, will fix it eventually.
That said, the ABA code isn't used on any hot paths in snmalloc, so the cost of a libcall is unlikely to be noticeable.
This combination of things (bug in a single compiler, doesn't affect hot paths) means that I'm very hesitant to do something ad-hoc as a work around for that compiler.
Looking at the code, we seem not to be doing LL/SC for non-x86 architectures. I thought @nwf added that a while ago?
I had a look at LL/SC on Arm, but couldn't work out how to do it. The ABA primitive is designed to be able to use it, I just didn't know a good way to actually achieve it without complex inline assembly across functions.
The use of the ABA protection is only for allocating allocators, and potentially in the backend. So falling back to locks for ABA protection is actually fine in most cases.
Hi,
I initiated this issue because I found that LLVM and GCC would behave quite differently on whether to inline atomic ops. This may deliver extra linkage with libatomic.so for rust.
For example, on x86, we are requiring the compiled target to have -mcx16
but gcc will still emit outline atomic ops.
For DW LL/SC, maybe https://github.com/taiki-e/portable-atomic/blob/main/src/imp/aarch64.rs can be considered as a reference?