box86 icon indicating copy to clipboard operation
box86 copied to clipboard

Dynarec does not emulate x86's strong memory ordering on ARM

Open iamahuman opened this issue 3 years ago • 15 comments

Background

Modern processors implement out-of-order execution to enhance performance. They can reorder instructions as they see fit, as long as the end result of the program stays the same in a single thread.

Memory accesses are no exception to this execution strategy. Processors can reorder memory loads and stores freely as long as the addresses don't alias (overlap) each other. This is called memory ordering.

Normally this isn't a problem in single-core (uniprocessor) CPUs, but the side-effects of reordering becomes visible in a multi-core (multiprocessor) CPUs.

To let applications suppress this reordering (for e.g. atomic operations), ISAs provide memory barrier instructions such as MFENCE/LFENCE/SFENCE on x86 or ISB/DSB/DMB on ARM.

Difference between x86 and ARM

To preserve compatibility with older (multiprocessor-unaware) apps and simplify programming, X86 has a strong memory ordering, specifically total store ordering (TSO); it forbids certain memory access reordering under specific circumstances.

Specifically, memory loads cannot be reordered after other memory accesses (acquire semantics: #LoadLoad and #LoadStore), and memory stores cannot be reordered before other memory accesses (release semantics: #LoadStore and #StoreStore).

However, ARM has none of these guarantees; it can reorder instructions however it wants. This is called weak memory ordering. Weaker memory ordering means more opportunity for the processor to optimise for better performance; however, it needs explicit memory barries if the programmer needs stronger guarantees.

The Problem

Many multithreaded applications require that certain parts of program are not reordered to function correctly. Such parts include synchronisation primitives like mutexes and semaphores. These often employ atomic instructions, which rely on memory ordering model and memory barriers of the target architecture.

Since x86 already has implicit acquire/release semantics for every memory access (even push/pop!), high-level languages often don't need explicit memory barrier instructions to achieve the memory ordering guarantee it wants. Thus, such memory accesses appear as simple MOVs.

However, ARM needs explicit barriers for these atomic accessses. If they are implemented as simple LDRs and STRs without barriers, the ARM processor is free to reorder things unlike X86. This results in subtle, hard-to-debug concurrency errors.

Solution

Fortunately, this is an already solved problem: see e.g. QEMU and Rosetta 2. Many JIT-based emulators use various strategies to correctly emulate memory ordering of different architectures.

Possible solutions:

  1. Emit barrier instructions before/after every memory access. - Too expensive; will impose major overhead.
  2. Perform memory accesses to local memory (e.g. local variables, thread-local storage, and other local heap allocations) normally, and trap accesses to memory shared between threads. This can be modeled after cache coherency architectures, such as MESI.

iamahuman avatar Dec 11 '21 03:12 iamahuman

It looks like Rosetta 2 relies on Apple's proprietary modification to implement total store ordering (TSO) on ARM: https://mobile.twitter.com/erratarob/status/1331735383193903104

See also: https://github.com/saagarjha/TSOEnabler

iamahuman avatar Dec 11 '21 03:12 iamahuman

Background

Modern processors implement out-of-order execution to enhance performance. They can reorder instructions as they see fit, as long as the end result of the program stays the same in a single thread.

Memory accesses are no exception to this execution strategy. Processors can reorder memory loads and stores freely as long as the addresses don't alias (overlap) each other. This is called memory ordering.

Normally this isn't a problem in single-core (uniprocessor) CPUs, but the side-effects of reordering becomes visible in a multi-core (multiprocessor) CPUs.

To let applications suppress this reordering (for e.g. atomic operations), ISAs provide memory barrier instructions such as MFENCE/LFENCE/SFENCE on x86 or ISB/DSB/DMB on ARM.

Difference between x86 and ARM

To preserve compatibility with older (multiprocessor-unaware) apps and simplify programming, X86 has a strong memory ordering, specifically total store ordering (TSO); it forbids certain memory access reordering under specific circumstances.

Specifically, memory loads cannot be reordered after other memory accesses (acquire semantics: #LoadLoad and #LoadStore), and memory stores cannot be reordered before other memory accesses (release semantics: #LoadStore and #StoreStore).

However, ARM has none of these guarantees; it can reorder instructions however it wants. This is called weak memory ordering. Weaker memory ordering means more opportunity for the processor to optimise for better performance; however, it needs explicit memory barries if the programmer needs stronger guarantees.

The Problem

Many multithreaded applications require that certain parts of program are not reordered to function correctly. Such parts include synchronisation primitives like mutexes and semaphores. These often employ atomic instructions, which rely on memory ordering model and memory barriers of the target architecture.

Since x86 already has implicit acquire/release semantics for every memory access (even push/pop!), high-level languages often don't need explicit memory barrier instructions to achieve the memory ordering guarantee it wants. Thus, such memory accesses appear as simple MOVs.

However, ARM needs explicit barriers for these atomic accessses. If they are implemented as simple LDRs and STRs without barriers, the ARM processor is free to reorder things unlike X86. This results in subtle, hard-to-debug concurrency errors.

Solution

Fortunately, this is an already solved problem: see e.g. QEMU and Rosetta 2. Many JIT-based emulators use various strategies to correctly emulate memory ordering of different architectures.

Possible solutions:

  1. Emit barrier instructions before/after every memory access. - Too expensive; will impose major overhead.
  2. Perform memory accesses to local memory (e.g. local variables, thread-local storage, and other local heap allocations) normally, and trap accesses to memory shared between threads. This can be modeled after cache coherency architectures, such as MESI.

Third possible solution would imply replacing mutexes with __transaction_atomic operation (thanks to gcc libitm extension). Please see https://github.com/ptitSeb/box64/issues/110#issuecomment-983708757

Regards.

bbbruni avatar Dec 11 '21 09:12 bbbruni

Third possible solution would imply replacing mutexes with __transaction_atomic operation (thanks to gcc libitm extension). Please see ptitSeb/box64#110 (comment)

Regards.

Thanks for the help! From https://github.com/ptitSeb/box64/issues/110 you referenced, I realised Dynarec has general concurrency issues. I'm afraid these issues are unrelated to this specific bug, which concerns correctness of code generation itself.

iamahuman avatar Dec 11 '21 11:12 iamahuman

Yes, different problem here.

To go back to the original ticket,

2. Perform memory accesses to local memory (e.g. local variables, thread-local storage, and other local heap allocations) normally, and trap accesses to memory shared between threads. This can be modeled after cache coherency architectures, such as MESI.

How do you suggest the "shared memory between threads" is identified?

ptitSeb avatar Dec 11 '21 12:12 ptitSeb

  1. Perform memory accesses to local memory (e.g. local variables, thread-local storage, and other local heap allocations) normally, and trap accesses to memory shared between threads. This can be modeled after cache coherency architectures, such as MESI.

How do you suggest the "shared memory between threads" is identified?

  1. Instead of threads, use processes with shared memory and lazily synchronise page tables between "threads."
  2. Introduce MESI page table subsystem; protect Modified/Exclusive pages as private read-write, Shared pages as shared read-only, and Invalid pages as private no-access.
  3. If a memory access is trapped on a Shared/Invalid page, do the following:
    1. If the page was Invalid, transition the page to Shared state -- migrate the private page from the owning "thread" to shared memory.
    2. Mirror the Shared page at certain offset (ShadowOffset) past the page's offset. This new mapping has read-write permissions.
    3. Rewrite the JIT code so that it adds ShadowOffset to the memory access, and insert appropriate memory barriers around it.
    4. (Optional) periodically scan for Shared pages that have not been accessed by multiple threads, and promote them back to Exclusive (private read-write).

Alternatively, we can acknowledge this flaw and provide settings to adjust the level of memory barriers emitted -- see https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on-arm-program-compat-troubleshooter#toggling-emulation-settings for how Windows on ARM does this.

iamahuman avatar Dec 11 '21 12:12 iamahuman

Well, yes, for now I have started the same strategy than windows arm. I have to fine-tune it, as I still don't want to put barrier in any memory access.

The process you suggested seems very heavy. And there are more issue, because thread share more than memory. Mutex and things like that would probably need special care. That seems very heavy.

ptitSeb avatar Dec 11 '21 13:12 ptitSeb

Well, yes, for now I have started the same strategy than windows arm. I have to fine-tune it, as I still don't want to put barrier in any memory access.

Yes, I agree this is the best way forward.

The process you suggested seems very heavy.

Heavy in terms of implementation cost, yes. In fact this issue alone can make a research subject on computer science. As for the performance cost, well, we don't know until we have measured it.

And there are more issue, because thread share more than memory.

You're right. Threads also share OS-level resources, such as file descriptor tables. Nevertheless, these can be managed almost the same way page tables are kept consistent across "threads."

Mutex and things like that would probably need special care. That seems very heavy.

If anything, mutexes and semaphores are the least concern here. Almost all modern OSes implement inter-process semaphores as part of their IPC mechanisms. Also, Linux and Windows has futexes.

iamahuman avatar Dec 11 '21 16:12 iamahuman

@iamahuman

Introduce MESI page table subsystem; protect Modified/Exclusive pages as private read-write, Shared pages as shared read-only, and Invalid pages as private no-access.

In principle, you can have a look at an approach taken by a British scientist Simon Tatham.

He published a lean idea on how to mimic concurrency in embedded devices (like AVR) with limited resources by means of coroutines.

His code is short but efficient. It resembles concurrent BlockingQueue from Java. Two functions call each other with an elegant analog of goto operator (see crBegin and crReturn macro). One function produces symbols for another one, another function consumes them in a pre-emptive manner.

Quoting him:

In many modern operating systems, you could do this using pipes between two processes or two threads. emit() in the decompressor writes to a pipe, and getchar() in the parser reads from the other end of the same pipe. Simple and robust, but also heavyweight and not portable. Typically you don't want to have to divide your program into threads for a task this simple.

In this article I offer a creative solution to this sort of structure problem.

https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

Replace getchar and add_to_token with a linked list of structs. Replace chars with struct memory_page. And you will get a construction close to what you've described in https://github.com/ptitSeb/box86/issues/519#issuecomment-991627700

bbbruni avatar Dec 11 '21 19:12 bbbruni

This would actually force everything to run on a single core, that would be bad for the performances, and the simple fact that it's running on single corre will "fix" any Strong vs Weak memory model issue anyway.

ptitSeb avatar Dec 11 '21 19:12 ptitSeb

This would actually force everything to run on a single core, that would be bad for the performances, and the simple fact that it's running on single corre will "fix" any Strong vs Weak memory model issue anyway.

Maybe he has a working approach to launch coroutines through IRQ calls from all cores, who knows... EDIT: I think one can make IRQ calls that will launch libitm memory transactions to the linked list. This actually doesn't make a research subject on computer science ;)

bbbruni avatar Dec 11 '21 19:12 bbbruni

This would actually force everything to run on a single core, that would be bad for the performances, and the simple fact that it's running on single corre will "fix" any Strong vs Weak memory model issue anyway.

Maybe he has a working approach to launch coroutines through IRQ calls from all cores, who knows... EDIT: I think one can make IRQ calls that will launch libitm memory transactions to the linked list. This actually doesn't make a research subject on computer science ;)

I'm not sure exactly what you meant (what is an "IRQ call"? Inter-process interrupts? System calls? ISR calls? Userspace context switching?). If I understood correctly, it wouldn't be good for performance, either.

  1. The coroutine approach you suggested earlier is equivalent to using only a single worker (core). A worker (core) can multitask (execute multiple threads), but ultimately it is executing only at most one task at a time. It's just switching between threads real fast.
  2. The alternative multi-core approach you suggested has a similar problem: although there are multiple workers (cores) im use, only one of them can ever be running while others are idle waiting for the running worker to take opportunity to execute. This is why some scripting language interpreters such as Python cannot achieve multi-processing performance even if it has multiple threads: see Global Interpreter Lock.

iamahuman avatar Dec 12 '21 03:12 iamahuman

I'm not sure exactly what you meant (what is an "IRQ call"? Inter-process interrupts? System calls? ISR calls? Userspace context switching?). If I understood correctly, it wouldn't be good for performance, either.

  1. The coroutine approach you suggested earlier is equivalent to using only a single worker (core). A worker (core) can multitask (execute multiple threads), but ultimately it is executing only at most one task at a time. It's just switching between threads real fast.
  2. The alternative multi-core approach you suggested has a similar problem: although there are multiple workers (cores) im use, only one of them can ever be running while others are idle waiting for the running worker to take opportunity to execute. This is why some scripting language interpreters such as Python cannot achieve multi-processing performance even if it has multiple threads: see Global Interpreter Lock.

I mean ISR calls. libitm has no global lock, but it may run transactions both in serial and parallel mode.

Quote from Section 2 of C Transactional memory standard:

[Note: Although transactions behave as if they execute in some serial order, an implementation (i.e., compiler, runtime, and hardware) is free to execute transactions concurrently while providing the illusion of serial ordering.]

Quote from Section 4 of the same document:

A transaction statement that uses the __transaction_atomic keyword defines an atomic transaction. We call such a statement an atomic transaction statement:

__transaction_atomic compound-statement

In a data-race-free program, an atomic transaction appears to execute atomically; that is, the compound statement appears to execute as a single indivisible operation whose operations do not interleave with the operations of other threads (Section 4.5). In this setting, atomic transactions allow a programmer to write code fragments that execute in isolation from other threads. The transactions do not observe changes made by other threads during their execution, and other threads do not observe partial results of the transactions.

An atomic transaction executes in an all-or-nothing fashion: it can be explicitly canceled so that its operations have no effect (Section 8).

How about the following scenario.

One core launches a memory transaction to modify a memory page via ISR call. Another core launches a memory transaction to modify the same page via ISR call.

The first transaction saves the version of the memory page at the very beginning of the transaction. Then it modifies the page. At the end of the transaction, the page version is checked. If it has been changed, the __transaction_cancel command is called.

The second transaction ends faster and modifies the mentioned page version and the page itself.

The coroutine approach can be modified to make ISR calls with atomic transactions inside it instead of while (1) loop.

bbbruni avatar Dec 12 '21 08:12 bbbruni

I mean ISR calls. libitm has no global lock, but it may run transactions both in serial and parallel mode.

Quote from Section 2 of C Transactional memory standard:

[Note: Although transactions behave as if they execute in some serial order, an implementation (i.e., compiler, runtime, and hardware) is free to execute transactions concurrently while providing the illusion of serial ordering.]

Quote from Section 4 of the same document:

A transaction statement that uses the __transaction_atomic keyword defines an atomic transaction. We call such a statement an atomic transaction statement: __transaction_atomic compound-statement In a data-race-free program, an atomic transaction appears to execute atomically; that is, the compound statement appears to execute as a single indivisible operation whose operations do not interleave with the operations of other threads (Section 4.5). In this setting, atomic transactions allow a programmer to write code fragments that execute in isolation from other threads. The transactions do not observe changes made by other threads during their execution, and other threads do not observe partial results of the transactions. An atomic transaction executes in an all-or-nothing fashion: it can be explicitly canceled so that its operations have no effect (Section 8).

I think this is in a wrong level of abstraction. C transactional memory is a high-level concept. We're talking about JIT code here (translation of x86 to ARM).

At the hardware level, we have ARM Transactional Memory Extensions (TME). Are you implying we should use TME? If so, I don't see how it would be more performant than inserting simple barriers around memory loads/stores.

If I'm wrong about this, can you provide an example of how the following x86-64 instruction could be translated to ARM64?

00401800: mov QWORD PTR [rax], rcx

The instruction above can be generated by either *ptr = val; or __atomic_store_n(ptr, val, __ATOMIC_RELEASE);. However, these two snippets compile to different machine code on ARM64.

Here's my take (assume RAX = X0, RCX = X1).

  • Incorrect (unless on Apple Sillicon with TSO enabled):
    STR X1, [X0]
    
  • Correct, but slow:
    DMB ISH
    STR X1, [X0]
    
  • Correct, but still slow and requires LDAR on the other side:
    STLR X1, [X0]
    

How about the following scenario.

One core launches a memory transaction to modify a memory page via ISR call. Another core launches a memory transaction to modify the same page via ISR call.

The first transaction saves the version of the memory page at the very beginning of the transaction. Then it modifies the page. At the end of the transaction, the page version is checked. If it has been changed, the __transaction_cancel command is called.

The second transaction ends faster and modifies the mentioned page version and the page itself.

The current issue (#519) states that the semantics of every memory access (be it mov, push, pop, ...) in Box86 is broken. That is, if a paticular MOV instruction is not required to be atomic by the original x86 compiler, it should work fine, but if it is intended to be atomic, then it wouldn't work correctly. Are you saying that we should use STM for every memory loads and/or stores? But not all loads/stores actually need atomicity.

The coroutine approach can be modified to make ISR calls with atomic transactions inside it instead of while (1) loop.

Also note that we cannot arbitrarily rewrite code of the program being emulated.

To be clear, let me state which scenarios are problematic and which ones are not.

  1. Emulated program only uses pthread_* functions for synchronisation AND does not use atomic variables/operations itself - Not a problem. We can just implement the pthread_* APis in ARM. Heck, we can replace all the pthread_mutex_(un)lock with STMs as you want.
  2. Emulated program implements its own synchronisation mechanisms (e.g. mutex, semaphores, atomic variables) - causes the problem. If we have the source code of the program and/or can reverse engineer it, we can deduce which instructions require atomicity and do whatever we want. However, we're writing an emulator, and this is not a generic approach.

iamahuman avatar Dec 12 '21 09:12 iamahuman

Basically, we can start from the simplest examples.

As per http://nickdesaulniers.github.io/blog/2013/04/03/basic-jit/

#include <stdio.h> // printf
#include <string.h> // memcpy
#include <sys/mman.h> // mmap, munmap

int main () {

//***

// Hexadecimal x86_64 machine code for: int mul (int a, int b) { return a * b; }

unsigned char bytecode [] = {
  0x55, // push rbp
  0x48, 0x89, 0xe5, // mov rbp, rsp
  0x89, 0x7d, 0xfc, // mov DWORD PTR [rbp-0x4],edi
  0x89, 0x75, 0xf8, // mov DWORD PTR [rbp-0x8],esi
  0x8b, 0x75, 0xfc, // mov esi,DWORD PTR [rbp-04x]
  0x0f, 0xaf, 0x75, 0xf8, // imul esi,DWORD PTR [rbp-0x8]
  0x89, 0xf0, // mov eax,esi
  0x5d, // pop rbp
  0xc3 // ret
};
//***

  // allocate executable memory via sys call
  void* mem = mmap(NULL, sizeof(code), PROT_WRITE | PROT_EXEC,
                   MAP_ANON | MAP_PRIVATE, -1, 0);

  // copy runtime code into allocated memory
  memcpy(mem, code, sizeof(code));

  // typecast allocated memory to a function pointer
  int (*func) () = mem;

  // call function pointer
  printf("%d * %d = %d\n", 5, 11, func(5, 11));

  // Free up allocated memory
  munmap(mem, sizeof(code));
}

We can run this program with the jitted function int mul (int a, int b) { return a * b; } under box86 and see if it runs correctly. If it does, we can wrap int mul (int a, int b) { return a * b; } into a pthread call and test it. Then we can wrap the pthread call into semaphore.

Step-by-step, we can determine why box86 fails.

bbbruni avatar Dec 12 '21 10:12 bbbruni

I think diagnosis is unnecessary. @ptitSeb said they are already working on the problem.

iamahuman avatar Dec 12 '21 12:12 iamahuman