Redesign syscallbuf to always unwind on interruption
This is a major redesign of the syscallbuf code with the goal of establishing the invariant that we never switch away from a tracee while it's in the syscallbuf code. Instead, we unwind the syscallbuf code completely and execute the syscall at a special syscall instruction now placed in the extended jump patch.
The primary motivation for this that this fixes #3285, but I think the change is overall very beneficial. We have significant complexity in the recorder to deal with the possibility of interrupting the tracee during the syscallbuf. This commit does not yet remove this complexity (the change is already very big), but that should be easy to do as a follow up. Additionally, we used to be unable to perform syscall buffering for syscalls performed inside a signal handler that interrupted a syscall. This had performance implications on use cases like stack walkers, which often perform multiple memory-probing system calls for every frame to deal with the possibility of invalid unwind info.
There are many details here, but here's a high level overview. The layout of the new extended jump patch is:
<stack setup>
call <syscallbuf_hook>
// Bail path returns here
<stack restore>
syscall
<code from the original patch site>
// Non-bail path returns here.
jmp return_addr
One detail worth mentioning is what happens if a signal gets delivered once
the tracee is out of the syscallbuf, but still in the extended jump patch
(i.e. after the stack restore). In this case, rr will rewrite the ip of the
signal frame to point to the equivalent ip in the original, now patched code
section. Of course the instructions in question are no longer there, but
the CFI will nevertheless be generally accurate for the current register state
(excluding weird CFI that explicitly references the ip of course).
This allows unwinders in the end-user-application to never have to unwind
through any frame in the rr syscallbuf, which seems like a desirable
property. Of course, sigreturn must perform the opposite transformation
to avoid actually returning into a patched-out location.
The main drawback of this scheme is that while the application will never see a location without CFI, GDB does still lack unwind information in the extended jump stub. This is not a new problem, but syscall events are now in the extended jump stub, so they come up quite frequently.
I don't think this is a huge problem - it's basically the same situation we used to have before the vdso changes. I believe the best way to fix this would be to establish some way of having rr inform gdb of its jump patches (in fact gdb already has this kind of mechanism for tracepoints, it's just not exposed for tracepoints initiated by the gdb server), but I don't intend to do this myself anytime in the near future. That said, I should note that doing this would not require any changes on the record side, so could be done anytime and start working retroactively for already recorded traces.
Note that this is just the x86_64/i686 version of this. Aarch64 merged while I was working on this, so that's not included yet, but should be reasonably straightforward. There's also a little bit of cleanup still to do, but this should work 100% otherwise, so if there's any concerns about the details, we can discuss them now.
Alright, I think this is working now everywhere. @rocallahan @yuyichao, please take a look. As I said, there's a fair bit of additional code deletion that could be done as a result of this, but I think we can do that in a follow up PR.
I think probably we should make a new rr release before we land this ... I expect this will destabilize things for a while.
I think probably we should make a new rr release before we land this ... I expect this will destabilize things for a while.
That's probably a good idea.
Rebased. Shall we go ahead with the release on master then? FWIW, I've used this a bit so far and have so far not found any new issues that came down to this change, but still an intermediate release is probably prudent.
Let's move forward with this! Can you rebase it? I promise I'll review it :-)
I'm still interested in this change. I ran into "Expected syscall_bp_vm to be clear" today while debugging a python process.