rr icon indicating copy to clipboard operation
rr copied to clipboard

Looking to implement AIO (the old one)

Open fbrosseau opened this issue 2 years ago • 28 comments

Hello,

Thanks for the great project.

I am interested in debugging a program which requires AIO (the io_submit & co) as a baseline, so rr's current implementation of ENOSYS for it prevents the whole program from starting. (Sidenote - we are in the process of migrating to io_uring, so I may be interested in contributing that too, but that would be later.)

I have a branch of rr where AIO is implemented and works (at least for my program), but I would like to know if my implementation is done in the way that makes most sense for rr.

Also, my implementation currently only supports going through the syscalls, but technically, like io_uring, AIO can also be used in a way where syscalls are not required for gathering completions, by polling at the ring directly from usermode. My guess is that if mainline rr were to support AIO (and no longer return ENOSYS), it would need to support that scenario too, in order to not break things that may be working today by having AIO disabled.


Just for a quick refresher on AIO:

  1. You create an AIO context with io_setup. The kernel returns a aio_context_t, which is in fact a usermode pointer to a aio_ring structure. This is a slight difference with io_uring, where you get back a file that you can mmap yourself from usermode.
  2. You submit IO with io_submit. Unlike io_uring, there is no way around this, you must always syscall to submit.
  3. Get the events: 3a. Typically, you would io_getevents to park and wait on your queue for completions. 3b. You may also have the kernel post an event (which you epoll/etc), and upon completion, you either use io_getevents, or peek into the ring yourself. 3c. You may also never call into the kernel at all and just poll into the ring yourself forever.

So, quite similar to io_uring in multiple ways, but also with notable differences.


So - what I have currently looks like this:

  1. On io_setup successful exit, allocate some context that will be responsible for tracking in-flight IO. Let's call this the RemoteAioContext. This object contains a list of iocbs.
  2. On io_submit entry, copy the user's iocb structure into the corresponding RemoteAioContext.
  3. On io_getevents successful exit, inspect the io_event list, and find the corresponding nodes from the RemoteAioContext. 3a. If the original opcode was a IOCB_CMD_PREAD or IOCB_CMD_PREADV, call RecordTask::record_remote for the memory, etc.
  4. (Also implement io_destroy/io_cancel to cleanup state, etc).

This approach works for my use case, because my program always uses io_getevents, but that may not be the case for other users of AIO. I have read your comments about io_uring implementation from #2613, and I would like to confirm what you think would be the best way to apply this to AIO.

Based on your suggestions for io_uring, what I think could work when applied to AIO, would be to have io_setup allocate its own copy of the aio_ring, and return that back to the debuggee, instead of the real one (this implies that all subsequent syscalls need to tweak the aio_context_t to restore the real value the kernel expects). Then, what I am unsure about, is how rr would go about to migrate completions from the real ring into the copy (while doing the correct record_remote & co if necessary according to opcode). If we assume the "worst" case where the usermode never ever syscalls for completions, when would that migration across rings happen?

Otherwise, would it be a possibility to keep things simple and to require an opt-in option for AIO support, and not implement the polling part?

fbrosseau avatar Jan 11 '22 14:01 fbrosseau

So - what I have currently looks like this:

That sounds reasonable.

My guess is that if mainline rr were to support AIO (and no longer return ENOSYS), it would need to support that scenario too, in order to not break things that may be working today by having AIO disabled.

That sounds like significant work (just like it would be for io_uring).

Then, what I am unsure about, is how rr would go about to migrate completions from the real ring into the copy (while doing the correct record_remote & co if necessary according to opcode). If we assume the "worst" case where the usermode never ever syscalls for completions, when would that migration across rings happen?

Seems to me well-behaved usermode code would empty the fake ring and then call some syscall, right? Well-behaved code shouldn't just busy-wait on the ring.

If we did want to support that, rr does wake up tracees periodically using PTRACE_INTERRUPT ... unless we're in unlimited_ticks_mode, so we'd need to disable unlimited_ticks_mode while AIO is in use. When rr wakes up it could check the ring and if we need to service it, emit a SCHED event to attach ring data copy records to.

I think the real question is, how much of this is actually worth doing. Is there a reason why you want to do the work to support AIO when io_uring is around the corner?

rocallahan avatar Jan 14 '22 10:01 rocallahan

Thanks for the answers.

Seems to me well-behaved usermode code would empty the fake ring and then call some syscall, right? Well-behaved code shouldn't just busy-wait on the ring.

Yes, I guess that is true. There has to be some logic somewhere that ends up doing syscalls for sure in any program, and otherwise the tick would do the trick.

Just an idea - I am not 100% familiar with everything rr is able to do, but would it be feasible to detect that the debuggee is poking into the ring from usermode by having it fault on first access of the fake ring or something?

I think the real question is, how much of this is actually worth doing. Is there a reason why you want to do the work to support AIO when io_uring is around the corner?

I agree. The problem is this is for a big deployed product, even if in our specific case we did ship a new version leveraging io_uring, there will be customer scenarios with non-bleeding-edge kernels for years.

I mean obviously it would be unfortunate to have this big blob of almost-dead code for AIO in rr, that is why I wanted to start this discussion and try to get guidance on the best way to go, because the way I see it, if AIO is done properly, it should be able to share a lot of code with the eventual io_uring support, which I guess that one is truly desirable.

fbrosseau avatar Jan 14 '22 13:01 fbrosseau

but would it be feasible to detect that the debuggee is poking into the ring from usermode by having it fault on first access of the fake ring or something?

Theoretically we could use data watchpoints, but I would only do that as a very last resort. That would create new requirements and failure modes for recording. Also there's only a limited number (4 usually) that can only cover a limited range of memory (8 bytes) so it doesn't generalize to more than 4 uses.

The problem is this is for a big deployed product, even if in our specific case we did ship a new version leveraging io_uring, there will be customer scenarios with non-bleeding-edge kernels for years.

That's a good reason. BTW are you able to tell us what this product is? :-) Just curious.

Adding the minimum support required to support your product is probably reasonable.

rocallahan avatar Jan 14 '22 18:01 rocallahan

Theoretically we could use data watchpoints

Oh, yea no if watchpoints are the only way then yea that's too complicated / impractical. I was thinking of mmaping' as PROT_NONE and catching the fault, isn't that something the rr debugger could intercept? But in any case, if you feel like this could go in just supporting the syscall way, then even better and simpler.

BTW are you able to tell us what this product is? :-) Just curious.

So I am only toying around with a prototype here, nothing formal yet, but this would be for SQL Server for Linux. In its current state it looks like we may be too syscall/multithread-heavy to hope for "negligible" recording overhead, in my basic tests the overhead is extreme, but we are always looking for ways to reduce these bottlenecks on our side anyway. In any case, the way I see it, rr could be an extremely valuable tool for boot failures, but hopefully we could work on this problem and make it viable for runtime too.

Just for context, our debugging experience is extremely custom and built on top of LLDB, and that's why I am looking to improve LLDB support, like in my PR #3067.

Interestingly, if this ends up working, rr will be used to debug "Windows" code (or rather, PE binaries), through this LLDB layer.

fbrosseau avatar Jan 24 '22 22:01 fbrosseau

In its current state it looks like we may be too syscall/multithread-heavy to hope for "negligible" recording overhead, in my basic tests the overhead is extreme

Multithreading is indeed an issue, but syscall-heavyness is usually ok as long as all the syscalls are buffered. You may want to look at the trace dump to see if there are any frequent syscalls that aren't buffered.

Interestingly, if this ends up working, rr will be used to debug "Windows" code (or rather, PE binaries), through this LLDB layer.

Old news :). We've been debugging our Windows binaries like this through Wine for years.

Keno avatar Jan 24 '22 22:01 Keno

Haha great, I hadn't realized Wine could easily debug the "Windows" side, I thought it required custom tooling, and not standard Windows tools? Like meaning, no source-level debugging on that side? For SQL Server, we are very similar in theory with Wine, although our approach is at a different abstraction level than Wine.

fbrosseau avatar Jan 24 '22 23:01 fbrosseau

Source level debugging, including mixed Windows/Linux code works fine, but requires a GDB patch: https://github.com/JuliaComputing/gdb-solib-wine and GDB isn't really designed for it. I agree it would be easier in LLDB, so maybe some of your tooling can help there also.

Keno avatar Jan 24 '22 23:01 Keno