SaBRe Mimalloc download

This PR contains the following changes that are required to be bundled together in order for tests to pass:

Statically links mimalloc to both SaBRe and all plugins by default.
- Static vs Dynamic: I chose to statically link mimalloc in order to avoid where the libmimalloc.so library should be placed. Having one binary that you can move anywhere you want is much more flexible.
- Why mimalloc: mimalloc uses mmap to handle its heap. This creates a natural separation between SaBRe+plugins and the client. mimalloc is also relatively small and uses cmake that makes it "easy" to integrate its build with SaBRe.
  - I also tried to integrate jemalloc but jemalloc by default uses brk() so we end up having the same issue as libc.
  - I didn't try tcmalloc.
- Why do we patch mimalloc: mimalloc by default will try and mmap pages at an empty space at a very low address space (look here and you can verify for your own interest by running: LD_PRELOAD=./libmimalloc.so.1.6 cat /proc/self/maps | less). These maps are created with mmap when SaBRe starts (as SaBRe now uses mimalloc). Now when we load a client that uses tsan, tsan scans the memory to check what is there already. We had issues with this scanning process before, this is why we have this SaBRe code here. tsan doesn't recognize these mimalloc-heap memory areas as "standard" and throws an error when scans the process memory maps. You can find the tsan allowed ranges here, whatever is outside of these ranges, will make tsan fail. In the beginning I though of patching the SaBRe code to hide the mimalloc-heap area but I think this will create more issues rather than solve anything. So the easiest way is to disable the mimalloc hint, and delegate the mmap area to the kernel. This is absolutely fine from mimalloc's perspective.
- What is MI_LOCAL_DYNAMIC_TLS=ON: Read here.
- Static linking vs compiling with Object: mimalloc suggests an alternative building way here. Honestly I tried to make this work but cmake made my life extremely hard and at some point I just quit. If anyone has an opinion regarding .a vs .o static linking, I would also like to know more.
Fixes (in a slow way) TLS swapping.
- As we discussed, the plugin lives in the link maps of SaBRe. Thus switching the TLS is mandatory every time we jump from the client to the plugin. This PR safeguards all access points to the plugin (i.e. syscalls and vDSO) and properly switches the TLS.
- The following solution introduces overhead as we introduce 2 extra syscalls for every client syscall. The ideal solution would be if the plugin were to live in the same link maps as the client.
  - The exact amount of overhead is to be measured.
Fixes an infinite recursion case that vDSO and syscalls are vulnerable to.
- mimalloc uses vDSO to sync heap access. SaBRe patches the whole vDSO shared object. This creates an infinite recursion situation when mimalloc (or any vDSO call in general) from inside the plugin makes a vDSO call which again ends up in the plugin handler which creates an infinite recursion.

May 04 '20 15:05 andronat

Thanks, @andronat. I haven't looked at the details, but I think the TLS switching should be separate from adding support for mimalloc. In other words, SaBRe should work the same with mimalloc as it does now for our plugins, even w/o switching the TLS. Furthermore, because it's expensive, I would make that configurable.

May 05 '20 11:05 ccadar

Going even further down the rabbit hole, when the %fs register switches we also need to update the thread_local variables in SaBRe.

But why do we need these vars to be thread_local?: Previously as you can see here SaBRe used 2 static vars to keep track of the position of the TLS. Unfortunately this won't work for any program with more than 1 thread. It is mandatory to have thread_local vars to keep track of each TLS as the TLS is per thread.

May 05 '20 12:05 andronat

Thanks, @andronat. I haven't looked at the details, but I think the TLS switching should be separate from adding support for mimalloc. In other words, SaBRe should work the same with mimalloc as it does now for our plugins, even w/o switching the TLS. Furthermore, because it's expensive, I would make that configurable.

@ccadar I'm afraid the TLS mechanism is fundamental to the correctness of SaBRe and thus we can't add mimalloc without fixing the TLS first. In addition, I don't think we can make this mechanism optional as:

mimalloc will always crash or recurse infinitely if we don't fix the TLS first (mimalloc uses pthreads and vDSO).
Any use of synchronization (e.g. pthreads or even printf which has some synchronization) will result in a crash or even worse a silent memory corruption.
Any memory allocation (e.g. malloc) will result in a crash or even worse a silent memory corruption.

It is a mere coincidence that until now we didn't see a crash in SaBRe's plugins. For example, in my VM all the tests are passing, while in Travis things are failing (this is why I asked for ssh access to travis-ci.com because I couldn't reproduce the issues in my machine), I was lucky enough for the silent memory corruption to not disturb the output of the tests.

Something that can be done is merge the TLS fixes first (when properly done, because I see travis failing again) and then the memory allocator. But for now mimalloc is a good test that surfaces important issues.

May 05 '20 13:05 andronat

@andronat I agree that to be able to support a more diverse set of plugins, we need to swap the TLS. But at the same time, I think it's important to allow plugins like Varan -- which perform their own synchronization, etc. -- to be as fast as they are now.

So I think the right solution is to add that fix, but make it configurable at compile time (or maybe even at runtime if the extra check doesn't add much overhead, not sure). And yes, I would make that change in a different PR.

May 05 '20 13:05 ccadar

To throw another wrench in the mix, TLS stores its variables (i.e. thread_local vars like those here) in fixed offsets at compile-time.

As an example, you can verify this by objdump -D sabre and by one of our tests objdump -D tests/Output/test_-static_-pie.c.tmp1.

From test tests/Output/test_-static_-pie.c.tmp1:

000000000045a820 <__ctype_init>:
...
  45a852:       48 c7 c1 e8 ff ff ff    mov    $0xffffffffffffffe8,%rcx
  45a859:       48 8d 97 00 02 00 00    lea    0x200(%rdi),%rdx
  45a860:       64 48 89 11             mov    %rdx,%fs:(%rcx)
  45a864:       48 c7 c2 e0 ff ff ff    mov    $0xffffffffffffffe0,%rdx
  45a86b:       64 48 89 02             mov    %rax,%fs:(%rdx)
  45a86f:       c3                      retq

From SaBRe:

000000000000cf38 <load_sabre_tls>:
    cf38:       55                      push   %rbp
    cf39:       48 89 e5                mov    %rsp,%rbp
    cf3c:       64 48 8b 04 25 e8 ff    mov    %fs:0xffffffffffffffe8,%rax
    cf43:       ff ff
    cf45:       48 85 c0                test   %rax,%rax
    cf48:       74 68                   je     cfb2 <load_sabre_tls+0x7a>
    cf4a:       64 48 8b 04 25 e8 ff    mov    %fs:0xffffffffffffffe8,%rax
...

It should be obvious that those two will always clash if the %fs register is the same.

But is the %fs register the same between SaBRe and the client?

The answer is yes most of the time, but no during the TLS swapping...

In functions load_client_tls and load_sabre_tls we keep track of the TLS position for "sabre-space" and "client-space" in 2 thread-local vars (code here). Now when we switch the TLS to go from the client to the plugin, we need to know where the TLS is. The only way to do this safely is if the variables are thread-local. But at the same time these vars cannot be thread-local as the will clash with thread_local vars from the client (due to the fixed offset nature of the TLS).

Further explanation can be found here.

Bottomline: Switching TLS is much much harder than expected to do it properly.

What can we do?: I'm not sure to be honest. This is a chicken-and-egg problem. I could potentially mmap some random page in memory and keep the TLS pointers there manually.

May 06 '20 19:05 andronat

Before the TLS switch you can take a pointer to the address of the thread local variable and access it through there as mentioned in GCC’s documentation: “When the address-of operator is applied to a thread-local variable, it is evaluated at run-time and returns the address of the current thread's instance of that variable. An address so obtained may be used by any thread. When a thread terminates, any pointers to thread-local variables in that thread become invalid.” Alternatively, you can make a global data structured indexed by thread id to keep that info around without relying on thread local variables

May 06 '20 20:05 daniel-grumberg

Before the TLS switch you can take a pointer to the address of the thread local variable and access it through there as mentioned in GCC’s documentation:

Hmm, I don't see how this could work. I would still need to store the address to some thread-local variable for later use.

Alternatively, you can make a global data structured indexed by thread id to keep that info around without relying on thread local variables

Yea, this is exactly what I wanted to avoid to implement 😞

May 06 '20 20:05 andronat

You can store the pointers as local variables in proxy_plugin_sc_handler. Just get the addresses before the switch there and explicitly pass them in to the TLS switching routines.

May 06 '20 20:05 daniel-grumberg

An update on my side:

@daniel-grumberg sorry for the delayed response, I didn't expect to take me that long... For what it's worth, I implemented a global hash table to work as custom TLS. See this commit here: ad58e236faf44456fb68f8ea997915bf70200b31.
I think I managed to properly initialize a custom TLS, but I'm not sure on how robust this method is. I'm actually using a hack to call some internal functions from the ld. For more see here.
Tests are passing and mimalloc integration is working.
I'll use this PR as a reference point for working with adding SaBRe plugins as dependencies to the client.

Issues:

As we discussed this is a slow solution as it adds two extra arch_prctl syscalls. This working solutions adds even more syscalls (multiple gettid()) required for the custom TLS to operate.

May 19 '20 02:05 andronat

Hi @andronat thanks! I would still remove the commit that switches to Mimalloc to a follow-up PR. I think it's orthogonal to the TLS issue, and it would be good to see that the regression tests pass before the switch (they should, right?).

May 21 '20 12:05 ccadar

Just another update on my side, I was able to run zboxfs as a new plugin with the current PR. To make it run though I had to solve another tricky issue, see commit here: 34df09d199979df4e1401c4057506db4f3820c83.

Whenever a new thread is created a new stack is also allocated for the new thread. In the beginning of this newly created stack, libc allocates the new TLS + some extra important information. The full struct that represents the TLS and everything else required by libc can be found here. These information are essential for even the most basic operations, like the TID number put by there the kernel. In commit 34df09d199979df4e1401c4057506db4f3820c83 I actually measured the size of struct pthread, subtracted by the newly allocated TLS, and I copy everything from the client's thread stack.

Unfortunately I'm not sure how can I use some more robust way to measure the size of these structs as libc doesn't expose them in a header. I used the gdb to dynamically call sizeof() on them and recored the sized.

This strategy is very risky as I don't know if any of these "after-TLS" information need to be carefully derived from the parent SaBRe process. I think at this point we are reaching the limits of this architecture and we should be moving forward with re-architecting the plugins.

May 26 '20 18:05 andronat

@andronat Thanks for your great endeavour trying to make this TLS scheme work. Unfortunately you're now running into the kind of intricacies that I absolutely wanted to avoid. All of this is completely undocumented and private to a specific libc implementation. It may even change between two minor revisions of glibc. Therefore, unless you have a clear idea of how to make this scheme robust, I'm not sure it is worth your time trying to go further into this direction. This is pointed out in this SO question: https://stackoverflow.com/a/10060871/357851

May 27 '20 09:05 parras

One clean solution that I can contemplate would be to call pthread_createon SaBRe every time the clone syscall is intercepted. This newly-created thread would just wait until a syscall is intercepted.

May 27 '20 10:05 parras

@parras

Thanks for your great endeavour trying to make this TLS scheme work. Unfortunately you're now running into the kind of intricacies that I absolutely wanted to avoid.

Yea, my point exactly. We are at the limit of this approach. Re-architecting how the plugins are working might be an one-way.

One clean solution that I can contemplate would be to call pthread_create on SaBRe every time the clone syscall is intercepted. This newly-created thread would just wait until a syscall is intercepted.

I've thought about about this but only briefly. If I understand correctly, what you mention is that we have one idle thread on the side, created by SaBRe, that we can use as a template to copy the required stack and TLS information to the newly created client thread, right?

In this case, I'm still concerned about the after-TLS details. e.g. the SaBRe thread will have a different TID stored in the after-TLS from the one the client thread will eventually have. So we will still have to copy and replace various items. I'm afraid we might miss something.

Another idea I had that I though only briefly, was to actually double call pthread_create and intercept the clone call twice while instrumenting the recursion. So when the client calls pthread_create I can call pthread_create again from the plugin and get everything in the right place. (or so I thought). There are at least 2 problems with this approach: 1) there are information added by the kernel on the after-TLS, the simplest example is the TID, and 2) this is heavy on memory as we spend RAM for 1 extra thread for every thread.

May 27 '20 17:05 andronat

Another idea I had that I though only briefly, was to actually double call pthread_create and intercept the clone call twice while instrumenting the recursion. So when the client calls pthread_create I can call pthread_create again from the plugin and get everything in the right place. (or so I thought). There are at least 2 problems with this approach: 1) there are information added by the kernel on the after-TLS, the simplest example is the TID, and 2) this is heavy on memory as we spend RAM for 1 extra thread for every thread.

This is more or less what I have in mind.

I don't get this. Since you call pthread_create, libc will handle everything properly on SaBRe side. Of course, you will get a different TID since it's a different thread. Is that a problem?
Definitely, that's the only real downside I can think of for now. This needs benchmarking.

May 27 '20 21:05 parras

I don't get this. Since you call pthread_create, libc will handle everything properly on SaBRe side. Of course, you will get a different TID since it's a different thread. Is that a problem?

My concern is there will be 1 clone and the kernel will write this info in only 1 stack. I'm afraid that I don't always know what the kernel writes after the clone syscall and coping these info in the correct place might be again problematic. What if there are also derived values based on info on the stack that might be different between the sabre and client threads? I completely agree though that this is much better than 34df09d.

In any case, I would like to explore a little bit adding the plugin as a dependency to the client before I go deeper on the current approach. I'll be opening a PR soon to compare.

May 28 '20 17:05 andronat

My concern is there will be 1 clone and the kernel will write this info in only 1 stack. I'm afraid that I don't always know what the kernel writes after the clone syscall and coping these info in the correct place might be again problematic. What if there are also derived values based on info on the stack that might be different between the sabre and client threads?

RIght. Then, we could go one step further in this direction and completely run the plugin inside the pthread_created thread instead of just using it as a template to copy data from. This would obviously add context switch overhead but that should be acceptable for "heavy" plugins -- I beleive zbox fits into this category.

May 28 '20 17:05 parras

SaBRe SaBRe copied to clipboard

Mimalloc download

SaBRe
SaBRe copied to clipboard