shadow
shadow copied to clipboard
Service syscalls from the shim
We believe much of the current overhead of the dev branch is due to switching between the plugin and shadow-worker processes to service syscalls. On such switches, either the process gets switched off the physical CPU core (which is expensive), or both processes run concurrently on two physical CPU cores, consuming an extra core per plugin thread.
We could instead link libshadow into the shim, and arrange for shadow's data structures to be in shared memory (by using a custom allocator in Shadow, which is well supported in glibc), to be loaded into a fixed address in every plugin process (in the LD_PRELOADed shim's initialization). In this case the shim could service syscalls itself without having to return control to a separate Shadow thread.
Additionally, when the syscall does block the plugin thread, it could first unblock the next thread scheduled thread, by posting to its semaphore in shared memory, and then block on its own semaphore. That way there'd only be a context switch to the next plugin thread rather than switching back to Shadow only to schedule the next plugin thread.
BTW this is similar to User Mode Linux (UML)'s "traditional" or "tracing thread" strategy. In that strategy they have a single tracing thread for all UML processes, but the tracing thread just transfers control to the UML code in the UML process on signals and syscalls; we could do the same here if we still use ptrace to catch syscalls that aren't already interposed by the shim.
The UML docs generally cite the more recent "separate kernel address space" (skas) strategies as preferred. These are closer to Shadow's current threadptrace model, where UML/Shadow isn't loaded into the traced process's address space, and the UML/Shadow process services the syscalls. However, I don't think the advantages apply in our case. http://user-mode-linux.sourceforge.net/old/skas.html
- It saves virtual address space in the traced process. When UML was being developed they were working with a 32-bit address space. I think this is a non-issue on x86-64.
- It prevents the traced process from directly accessing the UML (or in our case Shadow) state loaded into the process. We don't care about this as much in our security model, where we trust the traced processes. While it does mean a wild write in a traced process could corrupt Shadow's state instead of just that process's state, the end result is a spoiled simulation in either case.
- UML gets a performance boost by servicing syscalls in the tracing thread instead of having to return control to the traced process. While this extra control transfer would indeed be slow for the "ptrace path", for paths that we've optimized to go through the "preload path" there would be no such control transfers back and forth (which again is the motivating factor for making this change :) )
Since this issue was created, we've learned more about the performance cost of switching between Shadow and the managed processes. Switching cost is much less significant than we originally thought it would be when:
- using 1 shadow worker thread for each managed process
- pinning each shadow worker thread and its managed process to the same core
- using shared memory with semaphores to signal the switch
Moving more code into the managed process space could still improve performance even with the above optimizations, but there are still a lot of unknowns and potential complexity in moving "most" of Shadow to the managed process space.