Any way to run just chaos mode without any recording/syscall intercepts?
We're observing a very strange failure & trying to rule things out. Basically we have process A & process B. process B is forked from process C which itself is forked from process A & writes to the write FD endpoint of a pipe pair it has inherited (opened in A). Process A will asynchronously wait on edge-triggered epoll waiting for the read end point to become readable (the FD is registered with epoll prior to the initial fork in A). We only reliably observe a failure under rr's chaos mode.
Obviously the question is whether rr record is somehow altering the behavior of syscalls in a broken way - would it be possible to add a mode to rr (or is there one already?) that just runs the instrumented command in chaos mode & doesn't do anything else to rule out the recording itself somehow causing problems? We do observe a "similar" error even outside of rr but much less frequently and it's not necessarily the same failure mode as what's happening with rr. My gut is that rr is catching a real problem here - I'm just trying to rule out all possibilities.
would it be possible to add a mode to rr (or is there one already?) that just runs the instrumented command in chaos mode & doesn't do anything else to rule out the recording itself somehow causing problems?
That doesn't exist. You could write a ptrace supervisor that implements rr chaos mode scheduling and mmap randomization, but not the rest of rr, but that would have very limited use-cases.
epoll_wait does have some slightly strange behaviour under ptrace (e.g. see https://github.com/iovisor/bpftrace/issues/712 ) so I can't rule out that your bug is caused by rr recording, though if that's your issue then a standalone chaos-mode ptrace supervisor would have the same problem.
It is also common for rr chaos mode to reveal real bugs that are normally latent. You may consider those not worth fixing, but an rr recording should them make not hard to fix, and if you use Pernosco it can be easier still. Also a bug that appears to be latent can show up very infrequently, or on a different platform, or in the future, or only on certain user machines, and those are all nightmares that will make you wish you'd fixed it! (FWIW part of the original motivation for chaos mode was to get a Firefox bug that only showed up on Android in Firefox CI to reproduce on desktop Linux.)
Yeah, we handle EINTR for epoll_wait AFAICT (https://github.com/capnproto/capnproto/blob/master/c%2B%2B/src/kj/async-unix.c%2B%2B#L599). I concur that it's concerning & my personal prior is that it is a latent bug. However, I've been unsuccessful figuring out via the rr recording what's going on - one process does appear to issue the write but the read end never seems to see the epoll trigger. How would Pernosco aide in diagnosing this cross-process IPC issue more effectively than naiive gdb debugging?
What I almost want is to be able to query the kernel as to what the fd has in the queue/how much is buffered on the read side in the playback.
Is there a reference spot within the rr codebase I could look into more for how you do scheduling & mmap randomization?
https://github.com/rr-debugger/rr/blob/815eae2f4de45c990c92dcf12da818dcf816b7ea/src/Scheduler.cc#L362 https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html
How would Pernosco aide in diagnosing this cross-process IPC issue more effectively than naiive gdb debugging?
If this bug is some weird kernel issue and you already know the exact syscall sequence, maybe it wouldn't. In general it makes everything better though :-).
I would examine the rr dump to confirm your assumptions about the syscall sequence; it will show you the syscalls and the schedule. You might want to check that the data written by B is not read out of the pipe unexpectedly before A's epoll; the rr dump might help you locate candidate syscalls to rule out. It would also be interesting to see whether A did the epoll EINTR round trip.
If you can reproduce the bug with rr record -n that will give you more information about each syscall because our syscall buffering optimization will be turned off. The recording will be a lot slower though and might make it impossible to reproduce the bug.
When the test hangs under rr, you should be able to use lsof or something like that to inspect the buffered amount.
These are all fantastic tips. Thanks so much! I'll definitely try this out shortly.
When I turn off syscall buffering (rr -n) the issue appears to disappear. I haven't had luck investigating getting the buffer usage via lsof - do I need to add anything other than -s? I did issue a FIONREAD ioctl on the read after epoll_wait times out & it does report there's 88 bytes in the buffer. Does this indicate some kind of subtle bug in syscall buffering?
Are you using edge-triggered epoll or level-triggered?
edge-triggered epoll
That could explain why you have data in the buffer and yet epoll did not unblock. Are you sure your design here is correct? Keep in mind that written data can be added to the pipe any time between the start and end of a write() syscall, and that the actual work of epoll() can happen anytime after the start of the syscall.
Are you sure your design here is correct?
The design & implementation seem to handle the corner cases I've read about. As best I can tell the issue isn't in KJ itself (which is the event loop underlying our proprietary codebase). It's always possible it's higher up but I don't really see the higher up code do anything truly weird.
The original process (which creates the pipe(2)s) registers the read fd (non-blocking) with its own epoll_wait before it forks the child (the read endpoint is set to close-on-exec). The write FD is passed through to the forked child which then forks again & writes to it (the FD is in blocking mode).
When I debug the problem scenario in rr replay, I never see the read FD get signaled in epoll_wait so it's a more subtle problem than the typical flaw of "got the edge-triggered epoll but didn't fully consume the fd".
Maybe I need to see if I can write a trivial program to reproduce this using rr.
Maybe I need to see if I can write a trivial program to reproduce this using
rr.
That would be great if you can.
One thing you could do is build rr yourself, but edit syscallbuf.c to remove the epoll_wait case and see if that fixes your problem.
I added a trivial rr test for edge-triggered epoll here: https://github.com/rr-debugger/rr/commit/fd5178713bfa8d25af7b9858309a542a0631aed0 It passes with kernel 5.14.
I wonder about the syscallbuf code here. It calls epoll_wait() with a zero timeout. If that call returns no events, or EINTR, then we retry the epoll_wait() with its original arguments. Maybe an EINTR return followed by a retry doesn't work for edge-triggered epoll_wait()s?
Maybe an
EINTRreturn followed by a retry doesn't work for edge-triggeredepoll_wait()s?
Seems to me that if the epoll_wait() returns EINTR then there should have been no edge-trigger pending, and if the edge triggers after that was checked, then it should cause the next "traced" epoll_wait() to unblock. But maybe the kernel has issues here...
The trace doesn't show EINTR on the signalled fd FWIW. I've also found that a pure fork doesn't work - exec needs to be in the middle there.
The trace is a bit big so I put it on Google Drive (can switch to Dropbox if that's more convenient): https://drive.google.com/file/d/1skzdhNHOALN-dEuHSFB1p8RFqIVNRzLI/view?usp=sharing
The link is restricted as I'm not sure whether the rr trace may contain PII (just send me a request for access and I'll approve).
The machine I ran this on was a laptop with an Intel i7-10750H CPU.
The test code I used:
#include <kj/test.h>
#include <kj/async.h>
#include <kj/async-io.h>
#include <kj/io.h>
#include <capnp/compiler/lexer.capnp.h>
#include <capnp/message.h>
#include <capnp/serialize.h>
#include <capnp/serialize-async.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <unistd.h>
int main(int argc, char** argv) {
// Create this threads event loop.
auto binExe = argv[0];
if (argc == 1) {
auto io = kj::setupAsyncIo();
// Initialize the file descriptors.
kj::Own<kj::AsyncInputStream> peerStream;
kj::AutoCloseFd peerOutFd;
{
int rawFds[2];
KJ_SYSCALL(pipe(rawFds));
KJ_SYSCALL(ioctl(rawFds[0], FIOCLEX));
// Make sure the read endpoint isn't inherited across fork.
peerOutFd = kj::AutoCloseFd(rawFds[1]);
peerStream = io.lowLevelProvider->wrapInputFd(kj::AutoCloseFd(rawFds[0]));
}
// Setup children to write to us.
kj::Vector<const char*> args;
auto peerOutFdAsStr = kj::str(peerOutFd.get());
args.add(binExe);
args.add("child1");
args.add(peerOutFdAsStr.cStr());
args.add(nullptr);
int childPid;
KJ_SYSCALL(childPid = fork());
if (childPid == 0) {
// const_cast OK because execv() doesn't really modify the args, it just has a bad signature
// for historical reasons.
KJ_DBG("Launching", binExe);
execvp(binExe, const_cast<char**>(args.begin()));
perror("exec");
_exit(1);
}
usleep(100 * kj::MILLISECONDS / kj::MICROSECONDS);
// Read the message from the child.
auto& timer = io.provider->getTimer();
KJ_DBG("Waiting for message", getpid());
timer.timeoutAfter(2 * kj::SECONDS, capnp::readMessage(*peerStream.get())).wait(io.waitScope);
KJ_DBG("Got message");
int status;
KJ_SYSCALL(waitpid(childPid, &status, 0));
return WEXITSTATUS(status);
} else if (strcmp(argv[1], "child1") == 0) {
KJ_DBG("Started child process", getpid());
kj::AutoCloseFd peerOutFd(kj::StringPtr(argv[2]).parseAs<int>());
int childPid;
KJ_SYSCALL(childPid = fork());
if (childPid == 0) {
KJ_DBG("Sending message", getpid(), peerOutFd.get());
// Fill a random cap'n'proto message.
capnp::MallocMessageBuilder message;
auto root = message.initRoot<capnp::compiler::Token>();
auto parenList = root.initParenthesizedList(2);
auto firstTokens = parenList.init(0, 2);
firstTokens[0].setIdentifier("some identifier");
firstTokens[1].setStringLiteral("some string literal");
auto secondTokens = parenList.init(1, 2);
secondTokens[0].setIntegerLiteral(235235);
secondTokens[0].setStartByte(1);
root.setIdentifier("my identifier");
root.setStringLiteral("some string literal");
{
kj::AutoCloseFd fd(peerOutFd.get());
capnp::writeMessageToFd(fd, message);
}
usleep(10 * kj::MILLISECONDS / kj::MICROSECONDS);
_exit(0);
}
int status;
KJ_SYSCALL(waitpid(childPid, &status, 0));
_exit(WEXITSTATUS(status));
} else {
KJ_DBG("Unknown arguments!");
_exit(1);
}
}
It took quite a number of runs of this in a loop & doesn't repro as easily as it does in my more complete test code so I'm not sure what else is needed. The compiled binary (not runnable but useful for symbols):
test-rr-epoll-wait.gz
The trace is a bit big so I put it on Google Drive (can switch to Dropbox if that's more convenient): https://drive.google.com/file/d/1skzdhNHOALN-dEuHSFB1p8RFqIVNRzLI/view?usp=sharing
I sent an access request.
What version of Capnp are you building against? 0.8 doesn't have lexer.capnp.h
Granted. I built against capnp master. lexer.capnp.h was just a random capnp file I had laying around in source folder - might be able to just grab it from master?. The primary motiviation was to trigger the writev path to the fd in the capnp source. Attached the capnp header & source although I imagine the CAPNP_VERSION won't match (maybe hacking the version number will just work? I don't think 0.9 has really had any changes).
lexer.tar.gz
I just built capnp master and it did not install lexer.capnp.h :-(
ok I hacked things enough to build and run it locally.
I can't reproduce this on 5.14 with 200 runs on the testcase with rr chaos mode.
So according to your trace the last epoll_wait returns 0 fds ready after about 2 seconds of blocking:
{
real_time:551471.654556 global_time:2103, event:`SYSCALL: epoll_wait' (state:ENTERING_SYSCALL) tid:2868900, ticks:5882250
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x10 rsi:0x7ffe5b802d90 rdi:0x3 rbp:0x681ffef0 rsp:0x681ffdd0 r8:0x7fffffff r9:0x563557525010 r10:
0x769 r11:0x246 r12:0x563555877d10 r13:0x0 r14:0x0 r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x6a175987
99c0 gs_base:0x0
}
{
real_time:551471.654561 global_time:2104, event:`SYSCALLBUF_RESET' tid:2868900, ticks:5882250
}
{
real_time:551473.554222 global_time:2105, event:`SYSCALL: epoll_wait' (state:EXITING_SYSCALL) tid:2868900, ticks:5882250
rax:0x0 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x10 rsi:0x7ffe5b802d90 rdi:0x3 rbp:0x681ffef0 rsp:0x681ffdd0 r8:0x7fffffff r9:0x563557525010 r10:0x769 r11:0x246
r12:0x563555877d10 r13:0x0 r14:0x0 r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x6a17598799c0 gs_base:0x
0
{ tid:2868900, addr:0x7ffe5b802d90, length:0xc0 }
}
{
real_time:551473.554360 global_time:2106, event:`SYSCALLBUF_FLUSH' tid:2868900, ticks:5885746
{ syscall:'clock_gettime', ret:0x0, size:0x20 }
}
but after that it looks like the child process is still starting up after exec:
{
real_time:551473.643158 global_time:2137, event:`SYSCALL: mmap' (state:EXITING_SYSCALL) tid:2868901, ticks:42302
rax:0x71a04da42000 rbx:0x802 rcx:0xffffffffffffffff rdx:0x1 rsi:0x20190 rdi:0x0 rbp:0x0 rsp:0x7ffcd1b3e058 r8:0x3 r9:0x0 r10:0x802 r11:0x246 r12:0x7ffcd1b3e070 r13:0x0 r14:0x1 r15:0x486b7bf0f500 rip:0x7f6824270b86 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9 fs_base:0x0 gs_base:0x0
{ map_file:"/usr/lib/libpthread-2.33.so", addr:0x71a04da42000, length:0x21000, prot_flags:"r--p", file_offset:0x0, device:65026, inode:14340, data_file:"/home/roc/tmp/y/test-rr-epoll-wait-0/mmap_pack_1_libpthread-2.33.so", data_offset:0x0, file_size:0x259b8 }
{ tid:2868901, addr:0x71a04da57480, length:0x1 }
{ tid:2868901, addr:0x71a04da5d048, length:0x4 }
}
so I think this recording failed the test because the child process ran extremely slowly (chaos mode doing its thing?) not because of any issue with epoll.
Hmm.... Damn it. Let me check rr dump on my side on the more complete binary.
Not working on this project anymore so cleaning up a bug I doubt will have any forward momentum.