rr
rr copied to clipboard
CI: Intermittent failure in cont_race
E.g.
- https://buildkite.com/julialang/rr/builds/191#eab915ff-919f-4167-98c2-dc592a739804
- https://buildkite.com/julialang/rr/builds/197#72075042-cf0c-4ea5-bdaf-627e7d37566a
- https://buildkite.com/julialang/rr/builds/194#505a91bf-d491-41b1-98df-c26f46887da2
There appear to be two bugs here:
- timeout during replay
- crash during recording with
[FATAL /cache/build/dockerized-amdci4-8/julialang/rr/src/AutoRemoteSyscalls.cc:567:infallible_send_fd()]
-> Assertion `child_fd >= 0' failed to hold. Failed to send fd; err=ESRCH
This is testing async SIGKILLs during recording so it's not that surprising that we might trigger multiple bugs.
For the crash, the rr stack is
#5 0x000056331c79a517 in rr::EmergencyDebugOstream::~EmergencyDebugOstream (this=0x7ffc8580e9e0, __in_chrg=<optimized out>) at /cache/build/dockerized-amdci4-8/julialang/rr/src/log.cc:451
#6 0x000056331c6f7dbe in rr::AutoRemoteSyscalls::infallible_send_fd (this=0x7ffc8580ead0, our_fd=...) at /cache/build/dockerized-amdci4-8/julialang/rr/src/AutoRemoteSyscalls.cc:567
#7 0x000056331c864523 in rr::RecordTask::init_buffers_arch<rr::X86Arch> (this=0x56331d340a10) at /cache/build/dockerized-amdci4-8/julialang/rr/src/RecordTask.cc:473
#8 0x000056331c8590d7 in rr::RecordTask::init_buffers (this=0x56331d340a10) at /cache/build/dockerized-amdci4-8/julialang/rr/src/RecordTask.cc:510
#9 0x000056331c811343 in rr::rec_process_syscall_arch<rr::X86Arch> (t=0x56331d340a10, syscall_state=...) at /cache/build/dockerized-amdci4-8/julialang/rr/src/record_syscall.cc:6639
#10 0x000056331c7fb306 in rr::rec_process_syscall_internal (t=0x56331d340a10, arch=rr::x86, syscall_state=...) at /cache/build/dockerized-amdci4-8/julialang/rr/src/record_syscall.cc:6673
#11 0x000056331c7fb453 in rr::rec_process_syscall (t=0x56331d340a10) at /cache/build/dockerized-amdci4-8/julialang/rr/src/record_syscall.cc:6693
#12 0x000056331c7dad18 in rr::RecordSession::syscall_state_changed (this=0x56331d330000, t=0x56331d340a10, step_state=0x7ffc8580ff6c) at /cache/build/dockerized-amdci4-8/julialang/rr/src/RecordSession.cc:1226
#13 0x000056331c7e062a in rr::RecordSession::record_step (this=0x56331d330000) at /cache/build/dockerized-amdci4-8/julialang/rr/src/RecordSession.cc:2428
This stack occurs in both 194 and 197.
Actually no, 197 is a different but similar-looking stack:
#5 0x000056245d7d3517 in rr::EmergencyDebugOstream::~EmergencyDebugOstream (this=0x7ffd32ffc6f0, __in_chrg=<optimized out>) at /cache/build/dockerized-amdci4-9/julialang/rr/src/log.cc:451
#6 0x000056245d730dbe in rr::AutoRemoteSyscalls::infallible_send_fd (this=0x7ffd32ffda70, our_fd=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/AutoRemoteSyscalls.cc:567
#7 0x000056245d902f6c in rr::Session::create_shared_mmap (remote=..., size=1048576, map_hint=..., name=0x7ffd32ffd930 "syscallbuf.47894", tracee_prot=3, tracee_flags=0, monitored=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/Session.cc:443
#8 0x000056245d92e6aa in rr::Task::init_syscall_buffer (this=0x56245ddefe10, remote=..., map_hint=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/Task.cc:2664
#9 0x000056245d89d244 in rr::RecordTask::init_buffers_arch<rr::X86Arch> (this=0x56245ddefe10) at /cache/build/dockerized-amdci4-9/julialang/rr/src/RecordTask.cc:456
#10 0x000056245d8920d7 in rr::RecordTask::init_buffers (this=0x56245ddefe10) at /cache/build/dockerized-amdci4-9/julialang/rr/src/RecordTask.cc:510
#11 0x000056245d84a343 in rr::rec_process_syscall_arch<rr::X86Arch> (t=0x56245ddefe10, syscall_state=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/record_syscall.cc:6639
#12 0x000056245d834306 in rr::rec_process_syscall_internal (t=0x56245ddefe10, arch=rr::x86, syscall_state=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/record_syscall.cc:6673
#13 0x000056245d834453 in rr::rec_process_syscall (t=0x56245ddefe10) at /cache/build/dockerized-amdci4-9/julialang/rr/src/record_syscall.cc:6693
#14 0x000056245d813d18 in rr::RecordSession::syscall_state_changed (this=0x56245dddf000, t=0x56245ddefe10, step_state=0x7ffd32ffef0c) at /cache/build/dockerized-amdci4-9/julialang/rr/src/RecordSession.cc:1226
#15 0x000056245d81962a in rr::RecordSession::record_step (this=0x56245dddf000) at /cache/build/dockerized-amdci4-9/julialang/rr/src/RecordSession.cc:2428
Yeah, I think the takeaway here is that we can't really to any infallible_* operations during record, since the tracee may die at any time.
We want three kinds of AutoRemoteSyscall functions:
- Infallible functions that abort on any error. These should only be called during replay.
- "Infallible" functions that abort on any error except ESRCH and just ignore ESRCH. This includes things like mmap() and close() that only have side effects in the tracee, and it doesn't really matter if they don't happen in a dead tracee.
- Fallible functions whose errors need to be handled in the caller.
I'm pondering what the naming scheme should be, in particular to distinguish the first two.
Actually I think the correct approach is to only have the latter two kinds of functions, but infallible functions abort if they get ESRCH during replay.
Functions like send_fd are a tricky case. I think we want them to return -ESRCH if the tracee died (and it's not a replay), but abort on any other error (or ESRCH in replay). I guess we can get away with calling it infallible even though it can return -ESRCH...
I'd argue that if we get -ESRCH the operation didn't really fail, instead it never happened at all :-).
Actually we have AutoRemoteSyscalls::infallible_syscall_if_alive. I like that name. We should make it abort if the tracee is dead and we're in replay, and incrementally migrate to it.
b422fed53f6c8b17299c087622111a4e4b47daba should fix the infallible_send_fd cases but there will be more. I have no idea what the replay timeout is about :-(.
With #3113, we should actually get a trace upload next time it happens. I'll queue up another bunch of builds to see if we can catch one.
https://buildkite.com/julialang/rr/builds/204#6e8010f3-cdda-4d58-898f-b0b1b79a7f89 is the only one that failed, but it failed during record in infallible_mmap.
4bd2467de3b402f1edb4a8423c3a720d8f10929b to fix infallible_mmap.
ok, that failed in replay and we do have a trace
[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/lib/rr/librrpage.so changed: replay divergence likely, but continuing anyway. inode: 101038655/96736484; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 18896/18896; mtime: 0/1648257863
[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_2NFyWr/rr-test-cont_race-CsXf88Rlk/cont_race-CsXf88Rlk changed: replay divergence likely, but continuing anyway. inode: 101038656/107220224; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 698640/698640; mtime: 0/1648257893
[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_2NFyWr/rr-test-cont_race-CsXf88Rlk/cont_race-CsXf88Rlk changed: replay divergence likely, but continuing anyway. inode: 101038656/107220224; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 698640/698640; mtime: 0/1648257893
[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_2NFyWr/rr-test-cont_race-CsXf88Rlk/cont_race-CsXf88Rlk changed: replay divergence likely, but continuing anyway. inode: 101038656/107220224; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 698640/698640; mtime: 0/1648257893
[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_2NFyWr/rr-test-cont_race-CsXf88Rlk/cont_race-CsXf88Rlk changed: replay divergence likely, but continuing anyway. inode: 101038656/107220224; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 698640/698640; mtime: 0/1648257893
[FATAL /home/roc/rr/rr/src/TraceStream.cc:1149:read_mapped_region() errno: ENOENT] Failed to stat /usr/lib/x86_64-linux-gnu/ld-2.33.so: replay is impossible
Yeah, we don't pack the traces before upload in case they're corrupt or something. In one of the other issues there are instructions to reconstitute the rootfs to get it to replay, but I can also do that when I'm back at my computer in a few hours.
FWIW the hang looks pretty disturbing.
#1 0x000055dc120c3f11 in rr::Task::wait (this=0x55dc127ce870, interrupt_after_elapsed=-1) at /cache/build/exclusive-amdci1-0/julialang/rr/src/Task.cc:1869
#2 0x000055dc120c26cc in rr::Task::resume_execution (this=0x55dc127ce870, how=rr::RESUME_SYSEMU, wait_how=rr::RESUME_WAIT, tick_period=rr::RESUME_UNLIMITED_TICKS, sig=0) at /cache/build/exclusive-amdci1-0/julialang/rr/src/Task.cc:1461
#3 0x000055dc1204de55 in rr::ReplaySession::continue_or_step (this=0x55dc127a9bb0, t=0x55dc127ce870, constraints=..., tick_request=rr::RESUME_UNLIMITED_TICKS, resume_how=rr::RESUME_SYSEMU) at /cache/build/exclusive-amdci1-0/julialang/rr/src/ReplaySession.cc:716
#4 0x000055dc1204f726 in rr::ReplaySession::emulate_async_signal (this=0x55dc127a9bb0, t=0x55dc127ce870, constraints=..., ticks=4630, in_syscallbuf_syscall_hook=...) at /cache/build/exclusive-amdci1-0/julialang/rr/src/ReplaySession.cc:1049
In one of the other issues there are instructions to reconstitute the rootfs to get it to replay
https://github.com/rr-debugger/rr/issues/3115
Ok, I can't reproduce this on my usual replay machine, which is an Intel machine with 5.6 kernel, but I can reproduce this on the CI machine, which is an AMD machine with kernel 5.4.0-51-generic (Ubuntu 18.04 LTS).
Ah, nevermind. It does actually finish, but it takes about two minutes for the usual reason that the large skid window on AMD causes excessive single stepping. On Intel it finishes in about 1.5s. I'll increase the timeout window.
Looks like we have at least one more error during record in cont_race: https://buildkite.com/organizations/julialang/pipelines/rr/builds/220/jobs/493a3c14-176b-4278-b9b3-d268d34a4286/raw_log
source_dir/src/test/util.sh: line 245: 93784 Aborted _RR_TRACE_DIR="$workdir" test-monitor $TIMEOUT record.err $RR_EXE $GLOBAL_OPTIONS record $LIB_ARG $RECORD_ARGS "$exe" $exeargs > record.out 2> record.err
--
| Test 'cont_race_32' FAILED: : error during recording:
| --------------------------------------------------
| rr: /cache/build/default-amdci4-5/julialang/rr/src/AddressSpace.cc:821: const rr::AddressSpace::Mapping& rr::AddressSpace::mapping_of(rr::remote_ptr<void>) const: Assertion `it != mem.end()' failed.
| --------------------------------------------------
| record.out:
| --------------------------------------------------
| --------------------------------------------------
| Test cont_race_32 failed, leaving behind /cache/build/default-amdci4-5/julialang/rr/obj/my_temp_parent_dir/jl_DmbP3p/rr-test-cont_race_32-733tz56Ao
| To replay the failed test, run
| _RR_TRACE_DIR=/cache/build/default-amdci4-5/julialang/rr/obj/my_temp_parent_dir/jl_DmbP3p/rr-test-cont_race_32-733tz56Ao rr replay
Unfortunately we don't have a backtrace for this kind of assertion.
One more infallible syscall that needs to be adjusted (https://buildkite.com/julialang/rr/builds/220#2eb5be6f-c07b-4f72-bae1-3d5e8cab4980):
#7 0x00005624a25f2fd6 in rr::AutoRemoteSyscalls::infallible_syscall<int> (this=0x7fff90b578d0, syscallno=6) at /cache/build/default-amdci4-4/julialang/rr/src/AutoRemoteSyscalls.h:162
--
| #8 0x00005624a27df7cc in rr::Session::create_shared_mmap (remote=..., size=1048576, map_hint=..., name=0x7fff90b57790 "syscallbuf.95098", tracee_prot=3, tracee_flags=0, monitored=...) at /cache/build/default-amdci4-4/julialang/rr/src/Session.cc:491
| #9 0x00005624a280ac00 in rr::Task::init_syscall_buffer (this=0x5624a2ff7950, remote=..., map_hint=...) at /cache/build/default-amdci4-4/julialang/rr/src/Task.cc:2664
| #10 0x00005624a27795e0 in rr::RecordTask::init_buffers_arch<rr::X86Arch> (this=0x5624a2ff7950) at /cache/build/default-amdci4-4/julialang/rr/src/RecordTask.cc:456
| #11 0x00005624a276e473 in rr::RecordTask::init_buffers (this=0x5624a2ff7950) at /cache/build/default-amdci4-4/julialang/rr/src/RecordTask.cc:512
| #12 0x00005624a27266de in rr::rec_process_syscall_arch<rr::X86Arch> (t=0x5624a2ff7950, syscall_state=...) at /cache/build/default-amdci4-4/julialang/rr/src/record_syscall.cc:6643
| #13 0x00005624a27106a1 in rr::rec_process_syscall_internal (t=0x5624a2ff7950, arch=rr::x86, syscall_state=...) at /cache/build/default-amdci4-4/julialang/rr/src/record_syscall.cc:6677
e91772880ce27861252f3bdec016c71a1bbe2330 should handle close() syscalls.
Unfortunately we don't have a backtrace for this kind of assertion.
Ok, b08ae8b9d2843bef8db92d5ef570a6f159adcd00 should give us backtraces for failed DEBUG_ASSERT.
But mapping_of should really return an error value instead of just dying when the mapping doesn't exist.
Here is the backtrace for that mapping_of assertion:
#5 0x00005561d71627b2 in rr::AddressSpace::mapping_of (this=0x5561d7fa5c40, addr=...) at /cache/build/default-amdci4-4/julialang/rr/src/AddressSpace.cc:821
--
| #6 0x00005561d73940dc in rr::Task::init_syscall_buffer (this=0x5561d7fadb00, remote=..., map_hint=...) at /cache/build/default-amdci4-4/julialang/rr/src/Task.cc:2662
| #7 0x00005561d7300538 in rr::RecordTask::init_buffers_arch<rr::X86Arch> (this=0x5561d7fadb00) at /cache/build/default-amdci4-4/julialang/rr/src/RecordTask.cc:456
| #8 0x00005561d72f4f42 in rr::RecordTask::init_buffers (this=0x5561d7fadb00) at /cache/build/default-amdci4-4/julialang/rr/src/RecordTask.cc:511
| #9 0x00005561d72aca37 in rr::rec_process_syscall_arch<rr::X86Arch> (t=0x5561d7fadb00, syscall_state=...) at /cache/build/default-amdci4-4/julialang/rr/src/record_syscall.cc:6643
9224099ad61eb6c9bd96c17fc9101c7254dfffa4 should fix that one.
cont_race timed out in https://buildkite.com/julialang/rr/builds/288#20667c72-6d23-46ca-9dee-d9bf5b79768e
PR #3512 could be related to this issue and show some more details.