rr icon indicating copy to clipboard operation
rr copied to clipboard

CI: Intermittent failure in cont_race

Open Keno opened this issue 3 years ago • 33 comments

E.g.

  • https://buildkite.com/julialang/rr/builds/191#eab915ff-919f-4167-98c2-dc592a739804
  • https://buildkite.com/julialang/rr/builds/197#72075042-cf0c-4ea5-bdaf-627e7d37566a
  • https://buildkite.com/julialang/rr/builds/194#505a91bf-d491-41b1-98df-c26f46887da2

Keno avatar Mar 24 '22 21:03 Keno

There appear to be two bugs here:

  • timeout during replay
  • crash during recording with
[FATAL /cache/build/dockerized-amdci4-8/julialang/rr/src/AutoRemoteSyscalls.cc:567:infallible_send_fd()]
-> Assertion `child_fd >= 0' failed to hold. Failed to send fd; err=ESRCH

This is testing async SIGKILLs during recording so it's not that surprising that we might trigger multiple bugs.

rocallahan avatar Mar 25 '22 08:03 rocallahan

For the crash, the rr stack is

#5  0x000056331c79a517 in rr::EmergencyDebugOstream::~EmergencyDebugOstream (this=0x7ffc8580e9e0, __in_chrg=<optimized out>) at /cache/build/dockerized-amdci4-8/julialang/rr/src/log.cc:451
#6  0x000056331c6f7dbe in rr::AutoRemoteSyscalls::infallible_send_fd (this=0x7ffc8580ead0, our_fd=...) at /cache/build/dockerized-amdci4-8/julialang/rr/src/AutoRemoteSyscalls.cc:567
#7  0x000056331c864523 in rr::RecordTask::init_buffers_arch<rr::X86Arch> (this=0x56331d340a10) at /cache/build/dockerized-amdci4-8/julialang/rr/src/RecordTask.cc:473
#8  0x000056331c8590d7 in rr::RecordTask::init_buffers (this=0x56331d340a10) at /cache/build/dockerized-amdci4-8/julialang/rr/src/RecordTask.cc:510
#9  0x000056331c811343 in rr::rec_process_syscall_arch<rr::X86Arch> (t=0x56331d340a10, syscall_state=...) at /cache/build/dockerized-amdci4-8/julialang/rr/src/record_syscall.cc:6639
#10 0x000056331c7fb306 in rr::rec_process_syscall_internal (t=0x56331d340a10, arch=rr::x86, syscall_state=...) at /cache/build/dockerized-amdci4-8/julialang/rr/src/record_syscall.cc:6673
#11 0x000056331c7fb453 in rr::rec_process_syscall (t=0x56331d340a10) at /cache/build/dockerized-amdci4-8/julialang/rr/src/record_syscall.cc:6693
#12 0x000056331c7dad18 in rr::RecordSession::syscall_state_changed (this=0x56331d330000, t=0x56331d340a10, step_state=0x7ffc8580ff6c) at /cache/build/dockerized-amdci4-8/julialang/rr/src/RecordSession.cc:1226
#13 0x000056331c7e062a in rr::RecordSession::record_step (this=0x56331d330000) at /cache/build/dockerized-amdci4-8/julialang/rr/src/RecordSession.cc:2428

rocallahan avatar Mar 25 '22 08:03 rocallahan

This stack occurs in both 194 and 197.

rocallahan avatar Mar 25 '22 08:03 rocallahan

Actually no, 197 is a different but similar-looking stack:

#5  0x000056245d7d3517 in rr::EmergencyDebugOstream::~EmergencyDebugOstream (this=0x7ffd32ffc6f0, __in_chrg=<optimized out>) at /cache/build/dockerized-amdci4-9/julialang/rr/src/log.cc:451
#6  0x000056245d730dbe in rr::AutoRemoteSyscalls::infallible_send_fd (this=0x7ffd32ffda70, our_fd=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/AutoRemoteSyscalls.cc:567
#7  0x000056245d902f6c in rr::Session::create_shared_mmap (remote=..., size=1048576, map_hint=..., name=0x7ffd32ffd930 "syscallbuf.47894", tracee_prot=3, tracee_flags=0, monitored=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/Session.cc:443
#8  0x000056245d92e6aa in rr::Task::init_syscall_buffer (this=0x56245ddefe10, remote=..., map_hint=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/Task.cc:2664
#9  0x000056245d89d244 in rr::RecordTask::init_buffers_arch<rr::X86Arch> (this=0x56245ddefe10) at /cache/build/dockerized-amdci4-9/julialang/rr/src/RecordTask.cc:456
#10 0x000056245d8920d7 in rr::RecordTask::init_buffers (this=0x56245ddefe10) at /cache/build/dockerized-amdci4-9/julialang/rr/src/RecordTask.cc:510
#11 0x000056245d84a343 in rr::rec_process_syscall_arch<rr::X86Arch> (t=0x56245ddefe10, syscall_state=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/record_syscall.cc:6639
#12 0x000056245d834306 in rr::rec_process_syscall_internal (t=0x56245ddefe10, arch=rr::x86, syscall_state=...) at /cache/build/dockerized-amdci4-9/julialang/rr/src/record_syscall.cc:6673
#13 0x000056245d834453 in rr::rec_process_syscall (t=0x56245ddefe10) at /cache/build/dockerized-amdci4-9/julialang/rr/src/record_syscall.cc:6693
#14 0x000056245d813d18 in rr::RecordSession::syscall_state_changed (this=0x56245dddf000, t=0x56245ddefe10, step_state=0x7ffd32ffef0c) at /cache/build/dockerized-amdci4-9/julialang/rr/src/RecordSession.cc:1226
#15 0x000056245d81962a in rr::RecordSession::record_step (this=0x56245dddf000) at /cache/build/dockerized-amdci4-9/julialang/rr/src/RecordSession.cc:2428

rocallahan avatar Mar 25 '22 08:03 rocallahan

Yeah, I think the takeaway here is that we can't really to any infallible_* operations during record, since the tracee may die at any time.

Keno avatar Mar 25 '22 16:03 Keno

We want three kinds of AutoRemoteSyscall functions:

  • Infallible functions that abort on any error. These should only be called during replay.
  • "Infallible" functions that abort on any error except ESRCH and just ignore ESRCH. This includes things like mmap() and close() that only have side effects in the tracee, and it doesn't really matter if they don't happen in a dead tracee.
  • Fallible functions whose errors need to be handled in the caller.

I'm pondering what the naming scheme should be, in particular to distinguish the first two.

rocallahan avatar Mar 25 '22 19:03 rocallahan

Actually I think the correct approach is to only have the latter two kinds of functions, but infallible functions abort if they get ESRCH during replay.

rocallahan avatar Mar 25 '22 19:03 rocallahan

Functions like send_fd are a tricky case. I think we want them to return -ESRCH if the tracee died (and it's not a replay), but abort on any other error (or ESRCH in replay). I guess we can get away with calling it infallible even though it can return -ESRCH...

rocallahan avatar Mar 25 '22 19:03 rocallahan

I'd argue that if we get -ESRCH the operation didn't really fail, instead it never happened at all :-).

rocallahan avatar Mar 25 '22 19:03 rocallahan

Actually we have AutoRemoteSyscalls::infallible_syscall_if_alive. I like that name. We should make it abort if the tracee is dead and we're in replay, and incrementally migrate to it.

rocallahan avatar Mar 25 '22 22:03 rocallahan

b422fed53f6c8b17299c087622111a4e4b47daba should fix the infallible_send_fd cases but there will be more. I have no idea what the replay timeout is about :-(.

rocallahan avatar Mar 26 '22 00:03 rocallahan

With #3113, we should actually get a trace upload next time it happens. I'll queue up another bunch of builds to see if we can catch one.

Keno avatar Mar 26 '22 00:03 Keno

https://buildkite.com/julialang/rr/builds/204#6e8010f3-cdda-4d58-898f-b0b1b79a7f89 is the only one that failed, but it failed during record in infallible_mmap.

Keno avatar Mar 26 '22 00:03 Keno

4bd2467de3b402f1edb4a8423c3a720d8f10929b to fix infallible_mmap.

rocallahan avatar Mar 26 '22 01:03 rocallahan

ok, that failed in replay and we do have a trace

Keno avatar Mar 26 '22 01:03 Keno

[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/lib/rr/librrpage.so changed: replay divergence likely, but continuing anyway. inode: 101038655/96736484; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 18896/18896; mtime: 0/1648257863
[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_2NFyWr/rr-test-cont_race-CsXf88Rlk/cont_race-CsXf88Rlk changed: replay divergence likely, but continuing anyway. inode: 101038656/107220224; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 698640/698640; mtime: 0/1648257893
[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_2NFyWr/rr-test-cont_race-CsXf88Rlk/cont_race-CsXf88Rlk changed: replay divergence likely, but continuing anyway. inode: 101038656/107220224; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 698640/698640; mtime: 0/1648257893
[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_2NFyWr/rr-test-cont_race-CsXf88Rlk/cont_race-CsXf88Rlk changed: replay divergence likely, but continuing anyway. inode: 101038656/107220224; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 698640/698640; mtime: 0/1648257893
[ERROR /home/roc/rr/rr/src/TraceStream.cc:1156:read_mapped_region()] Metadata of /cache/build/exclusive-amdci1-0/julialang/rr/obj/my_temp_parent_dir/jl_2NFyWr/rr-test-cont_race-CsXf88Rlk/cont_race-CsXf88Rlk changed: replay divergence likely, but continuing anyway. inode: 101038656/107220224; mode: 33261/33261; uid: 1000/0; gid: 1000/0; size: 698640/698640; mtime: 0/1648257893
[FATAL /home/roc/rr/rr/src/TraceStream.cc:1149:read_mapped_region() errno: ENOENT] Failed to stat /usr/lib/x86_64-linux-gnu/ld-2.33.so: replay is impossible

rocallahan avatar Mar 26 '22 01:03 rocallahan

Yeah, we don't pack the traces before upload in case they're corrupt or something. In one of the other issues there are instructions to reconstitute the rootfs to get it to replay, but I can also do that when I'm back at my computer in a few hours.

Keno avatar Mar 26 '22 01:03 Keno

FWIW the hang looks pretty disturbing.

#1  0x000055dc120c3f11 in rr::Task::wait (this=0x55dc127ce870, interrupt_after_elapsed=-1) at /cache/build/exclusive-amdci1-0/julialang/rr/src/Task.cc:1869
#2  0x000055dc120c26cc in rr::Task::resume_execution (this=0x55dc127ce870, how=rr::RESUME_SYSEMU, wait_how=rr::RESUME_WAIT, tick_period=rr::RESUME_UNLIMITED_TICKS, sig=0) at /cache/build/exclusive-amdci1-0/julialang/rr/src/Task.cc:1461
#3  0x000055dc1204de55 in rr::ReplaySession::continue_or_step (this=0x55dc127a9bb0, t=0x55dc127ce870, constraints=..., tick_request=rr::RESUME_UNLIMITED_TICKS, resume_how=rr::RESUME_SYSEMU) at /cache/build/exclusive-amdci1-0/julialang/rr/src/ReplaySession.cc:716
#4  0x000055dc1204f726 in rr::ReplaySession::emulate_async_signal (this=0x55dc127a9bb0, t=0x55dc127ce870, constraints=..., ticks=4630, in_syscallbuf_syscall_hook=...) at /cache/build/exclusive-amdci1-0/julialang/rr/src/ReplaySession.cc:1049

rocallahan avatar Mar 26 '22 02:03 rocallahan

In one of the other issues there are instructions to reconstitute the rootfs to get it to replay

https://github.com/rr-debugger/rr/issues/3115

DilumAluthge avatar Mar 26 '22 02:03 DilumAluthge

Ok, I can't reproduce this on my usual replay machine, which is an Intel machine with 5.6 kernel, but I can reproduce this on the CI machine, which is an AMD machine with kernel 5.4.0-51-generic (Ubuntu 18.04 LTS).

Keno avatar Mar 26 '22 23:03 Keno

Ah, nevermind. It does actually finish, but it takes about two minutes for the usual reason that the large skid window on AMD causes excessive single stepping. On Intel it finishes in about 1.5s. I'll increase the timeout window.

Keno avatar Mar 26 '22 23:03 Keno

Looks like we have at least one more error during record in cont_race: https://buildkite.com/organizations/julialang/pipelines/rr/builds/220/jobs/493a3c14-176b-4278-b9b3-d268d34a4286/raw_log



source_dir/src/test/util.sh: line 245: 93784 Aborted                 _RR_TRACE_DIR="$workdir" test-monitor $TIMEOUT record.err $RR_EXE $GLOBAL_OPTIONS record $LIB_ARG $RECORD_ARGS "$exe" $exeargs > record.out 2> record.err
--
  | Test 'cont_race_32' FAILED: : error during recording:
  | --------------------------------------------------
  | rr: /cache/build/default-amdci4-5/julialang/rr/src/AddressSpace.cc:821: const rr::AddressSpace::Mapping& rr::AddressSpace::mapping_of(rr::remote_ptr<void>) const: Assertion `it != mem.end()' failed.
  | --------------------------------------------------
  | record.out:
  | --------------------------------------------------
  | --------------------------------------------------
  | Test cont_race_32 failed, leaving behind /cache/build/default-amdci4-5/julialang/rr/obj/my_temp_parent_dir/jl_DmbP3p/rr-test-cont_race_32-733tz56Ao
  | To replay the failed test, run
  | _RR_TRACE_DIR=/cache/build/default-amdci4-5/julialang/rr/obj/my_temp_parent_dir/jl_DmbP3p/rr-test-cont_race_32-733tz56Ao rr replay

Unfortunately we don't have a backtrace for this kind of assertion.

Keno avatar Mar 27 '22 00:03 Keno

One more infallible syscall that needs to be adjusted (https://buildkite.com/julialang/rr/builds/220#2eb5be6f-c07b-4f72-bae1-3d5e8cab4980):



#7  0x00005624a25f2fd6 in rr::AutoRemoteSyscalls::infallible_syscall<int> (this=0x7fff90b578d0, syscallno=6) at /cache/build/default-amdci4-4/julialang/rr/src/AutoRemoteSyscalls.h:162
--
  | #8  0x00005624a27df7cc in rr::Session::create_shared_mmap (remote=..., size=1048576, map_hint=..., name=0x7fff90b57790 "syscallbuf.95098", tracee_prot=3, tracee_flags=0, monitored=...) at /cache/build/default-amdci4-4/julialang/rr/src/Session.cc:491
  | #9  0x00005624a280ac00 in rr::Task::init_syscall_buffer (this=0x5624a2ff7950, remote=..., map_hint=...) at /cache/build/default-amdci4-4/julialang/rr/src/Task.cc:2664
  | #10 0x00005624a27795e0 in rr::RecordTask::init_buffers_arch<rr::X86Arch> (this=0x5624a2ff7950) at /cache/build/default-amdci4-4/julialang/rr/src/RecordTask.cc:456
  | #11 0x00005624a276e473 in rr::RecordTask::init_buffers (this=0x5624a2ff7950) at /cache/build/default-amdci4-4/julialang/rr/src/RecordTask.cc:512
  | #12 0x00005624a27266de in rr::rec_process_syscall_arch<rr::X86Arch> (t=0x5624a2ff7950, syscall_state=...) at /cache/build/default-amdci4-4/julialang/rr/src/record_syscall.cc:6643
  | #13 0x00005624a27106a1 in rr::rec_process_syscall_internal (t=0x5624a2ff7950, arch=rr::x86, syscall_state=...) at /cache/build/default-amdci4-4/julialang/rr/src/record_syscall.cc:6677


Keno avatar Mar 27 '22 17:03 Keno

e91772880ce27861252f3bdec016c71a1bbe2330 should handle close() syscalls.

rocallahan avatar Mar 27 '22 22:03 rocallahan

Unfortunately we don't have a backtrace for this kind of assertion.

Ok, b08ae8b9d2843bef8db92d5ef570a6f159adcd00 should give us backtraces for failed DEBUG_ASSERT.

rocallahan avatar Mar 27 '22 23:03 rocallahan

But mapping_of should really return an error value instead of just dying when the mapping doesn't exist.

rocallahan avatar Mar 27 '22 23:03 rocallahan

Here is the backtrace for that mapping_of assertion:



#5  0x00005561d71627b2 in rr::AddressSpace::mapping_of (this=0x5561d7fa5c40, addr=...) at /cache/build/default-amdci4-4/julialang/rr/src/AddressSpace.cc:821
--
  | #6  0x00005561d73940dc in rr::Task::init_syscall_buffer (this=0x5561d7fadb00, remote=..., map_hint=...) at /cache/build/default-amdci4-4/julialang/rr/src/Task.cc:2662
  | #7  0x00005561d7300538 in rr::RecordTask::init_buffers_arch<rr::X86Arch> (this=0x5561d7fadb00) at /cache/build/default-amdci4-4/julialang/rr/src/RecordTask.cc:456
  | #8  0x00005561d72f4f42 in rr::RecordTask::init_buffers (this=0x5561d7fadb00) at /cache/build/default-amdci4-4/julialang/rr/src/RecordTask.cc:511
  | #9  0x00005561d72aca37 in rr::rec_process_syscall_arch<rr::X86Arch> (t=0x5561d7fadb00, syscall_state=...) at /cache/build/default-amdci4-4/julialang/rr/src/record_syscall.cc:6643


Keno avatar Mar 28 '22 19:03 Keno

9224099ad61eb6c9bd96c17fc9101c7254dfffa4 should fix that one.

rocallahan avatar Mar 29 '22 08:03 rocallahan

cont_race timed out in https://buildkite.com/julialang/rr/builds/288#20667c72-6d23-46ca-9dee-d9bf5b79768e

Keno avatar Apr 14 '22 03:04 Keno

PR #3512 could be related to this issue and show some more details.

bernhardu avatar Apr 29 '23 13:04 bernhardu