Two slight differences when compiling code as C and C++
Hi @daanx,
First thanks again for this awesome library!
My team (COD devs @ Activsion) have spotted some issues with hanging akin to https://github.com/microsoft/mimalloc/issues/890 but on Windows. Very rare, real hard to repro, but given our fleet of test infra it happens quite a lot.
We found that it maybe due to improper use of atomics, and compiling to C++ possibly fixes it (comparing the assembly generated .cod files from MSVC) shows the C++ version would add memory barriers, and the nop (npad 1) instruction, and possibly other changes.
So this was hopeful, but wanted to check what else changes by moving to C++, and I've found that now MSVCP140.dll loads too (which is not unexpected in general, but before it wasn't). And here are the two functions it needed.
(we have special custom build of mimalloc with switchable flavors at runtime controlled by env, but essentially each "mimalloc-flavor-xxxx.dll" is a mimalloc compiled with special flags, in case you are wondering about the naming).
So from the MSVCP140.dll it needs _Thrd_yield an std::get_new_handler - now the latter is expected I guess, but I got curious about the former.
Why would changing to C++ would need this now, and not before, and this led me to this discovery:
https://github.com/microsoft/mimalloc/blob/09a27098aa6e9286518bd9c74e6ffa7199c3f04e/include/mimalloc/atomic.h#L361
#if defined(__cplusplus)
#include <thread>
static inline void mi_atomic_yield(void) {
std::this_thread::yield();
}
#elif defined(_WIN32)
static inline void mi_atomic_yield(void) {
YieldProcessor();
}
#elif defined(__SSE2__)
#include <emmintrin.h>
...
and then to the std::this_thread::yield implementation -
https://github.com/microsoft/STL/blob/5f8b52546480a01d1d9be6c033e31dfce48d4f13/stl/src/cthread.cpp#L86
which seems to call SwitchToThread
_CRTIMP2_PURE void __cdecl _Thrd_yield() noexcept { // surrender remainder of timeslice
SwitchToThread();
}
So the difference is that: YieldProcessor emits pause (same as what _mm_pause would do), while std::this_thread::yield would call SwitchToProcessor() that may sometimes call into kernel to reschedule the current thread. I've asked ChatGPT here - https://chatgpt.com/share/689beeff-69c0-800a-be43-c16de7a405e3
I didn't want to change too much the behavior, so I've decided to stick for these two cases the way "C" did it (and also did not want to load the MSVC140.dll - it gets loaded anyway, but didn't want to disturb this part).
But it raises the question - what is really intended to be called here - The "pause" (YieldProcessor), or std::this_thread::yield SwitchToProcessor?
Thank you!
A further example of mi_atomic differences with C and C++ can be seen in this small example where the C atomics end up being reordered incorrectly for MSVC
https://godbolt.org/z/fbKrj79h6
A further example of
mi_atomicdifferences with C and C++ can be seen in this small example where the C atomics end up being reordered incorrectly for MSVChttps://godbolt.org/z/fbKrj79h6
Note the MSVC/C variant produces a warning. It looks like the correct type for now_serving should be uintptr_t.
Tried with that?
Using uintptr_t removes the warning, however the semantics of the atomics are still incorrect
https://godbolt.org/z/9j6r8exEW
Specifically mi_atomic_store_explicit and mi_atomic_loadi64_explicit when compiling with _MSC_VER and _M_X64 / _M_IX86 because _Atomic does nothing when compiling for MSVC (and support for it also seems experimental).
#if defined(__cplusplus)
...
#elif defined(_MSC_VER)
// Use MSVC C wrapper for C11 atomics
#define _Atomic(tp) tp
static inline void mi_atomic_store_explicit(_Atomic(uintptr_t)*p, uintptr_t x, mi_memory_order mo) {
(void)(mo);
#if defined(_M_IX86) || defined(_M_X64)
*p = x;
#else
...
}
static inline int64_t mi_atomic_loadi64_explicit(_Atomic(int64_t)*p, mi_memory_order mo) {
(void)(mo);
#if defined(_M_X64)
return *p;
#else
...
}
Specifically
mi_atomic_store_explicitandmi_atomic_loadi64_explicitwhen compiling with_MSC_VERand_M_X64/_M_IX86because_Atomicdoes nothing when compiling for MSVC (and support for it also seems experimental).
Internet wisdom says "Normal loads up to 64-bits are atomic on x86, provided they're within a single cache-line (on any 64-bit capable CPU). In practice that means you're safe provided the data is aligned as the compiler normally would (alignment == size). " Presumably all of these requisites are assumed to be met inside mimalloc, so seems "don't do anything special" should be fine?
(There are probably more "proper" sources available than Reddit, but I don't want to bother researching r/n.)
Addendum: But it is weird the while loop is missing in that particular case.
My guess is that's could be result of optimization, maybe because the loop doesn't really have a side effect?
C (and C++) define an abstract machine which contains things like the execution and memory model, this is where ordering of operations is defined, when compiling to something like x86 there is no require to use x86 memory model just instead it run using the as-if rule of it having run in the C abstract machine.
This reordering and eliminating of code is done based on this abstract machine, so even when compiling for x86 the compiler internally when optimizing code will not use x86 total store order (TSO) and instead use the C memory model which is much weaker.
We are well past the days of C or C++ being a glorified assembler, optimizers do a huge amount of work reordering, eliminating dead code, etc. While the CPU won't reorder things the compiler most certain will.
You can read more here https://en.cppreference.com/w/c/language/memory_model.html
Luckily, we can see what std::atomic<> does on MSVC.
Seems to me it's this: https://github.com/microsoft/STL/blob/5f8b52546480a01d1d9be6c033e31dfce48d4f13/stl/inc/atomic#L922
__iso_volatile_loadNN are documented here: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170#IsoVolatileLoadStore -
"These intrinsic functions explicitly perform loads and stores that aren't subject to compiler optimizations."
(Despite being listed as an ARM64 intrinsic, seems to work fine on x64.)
Indeed, replacing mi_atomic_load_acquire with __iso_volatile_load64 in the godbolt link above re-adds the "missing loop" in the MSVC/C case.
So perhaps the fix for machine code differences would be to use __iso_volatile_loadXX as well, or perhaps just throw in a cast to a volatile pointer (as suggested by https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170#remarks).
As @reductor has already explained, the standard describes an abstract machine, and the memory model introduced in C11/C++11 defines a set of rules around the happens-before relation, which in turn defines the possible optimizations and reorderings that are allowed to be performed, either by the compiler and/or the hardware.
Load/store operations on properly aligned pointer sized values may be atomic on virtually all architectures, that is not sufficient. Historically volatile has been used for variables that may be accessed from multiple threads concurrently, because it forces the compiler to reread the variable, but that is also not enough. volatile was never designed or intended for multi-threaded use. It prevents reordering of volatile operations with respect to each other, but not with non-volatile operations. But that is exactly what we need for concurrent code. So just using __iso_volatile_loadXX instead is also not enough - the important part in the linked code above is the _ATOMIC_POST_LOAD_BARRIER_AS_NEEDED.