rr icon indicating copy to clipboard operation
rr copied to clipboard

Unrecorded SIGSEGV during replay

Open ianatol opened this issue 4 years ago • 14 comments

While attempting to replay trace julia-3 in https://julialang-dumps.s3.amazonaws.com/linux64/3824/rr-run_3824-gitsha_da727d3563-2021-09-26_10_41_36.tar.zst, encountered a crash in rr at global_time 27:

[FATAL /workspace/srcdir/rr/src/ReplaySession.cc:491:cont_syscall_boundary()]
 (task 21565 (rec:12113) at time 27)
 -> Assertion `false' failed to hold. Replay got unrecorded signal {signo:SIGSEGV,errno:SUCCESS,code:SEGV_MAPERR,addr:0x7fe971f79048}
Tail of trace dump:
[snip] ...
{
  real_time:69877284.393712 global_time:27, event:`SYSCALL: brk' (state:ENTERING_SYSCALL) tid:12113, ticks:189
rax:0xffffffffffffffda rbx:0x400040 rcx:0xffffffffffffffff rdx:0x0 rsi:0x4007e0 rdi:0x0 rbp:0x7fe971d587c0 rsp:0x7ffe547dd468 r8:0x6fffd000 r9:0x37f r10:0x64 r11:0x246 r12:0xa r13:0x7ffe547dd7e9 r14:0x0 r15:0x1000 rip:0x7fe971d6cb8a eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xc fs_base:0x0 gs_base:0x0
}
{
  real_time:69877284.393746 global_time:28, event:`SYSCALL: brk' (state:EXITING_SYSCALL) tid:12113, ticks:189
rax:0x14df000 rbx:0x400040 rcx:0xffffffffffffffff rdx:0x0 rsi:0x4007e0 rdi:0x0 rbp:0x7fe971d587c0 rsp:0x7ffe547dd468 r8:0x6fffd000 r9:0x37f r10:0x64 r11:0x246 r12:0xa r13:0x7ffe547dd7e9 r14:0x0 r15:0x1000 rip:0x7fe971d6cb8a eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xc fs_base:0x0 gs_base:0x0
  { map_file:"<ZERO>", addr:0x14df000, length:(nil), prot_flags:"---p", file_offset:0x0, device:0, inode:0, data_file:"", data_offset:0x0, file_size:0x0 }
}

Replays fine with replay -a, so we believe there is an issue with the replayer rather than the trace itself

CC: @Keno

ianatol avatar Sep 28 '21 17:09 ianatol

Replays fine with replay -a, so we believe there is an issue with the replayer rather than the trace itself

If replay -a works, can you be more specific about what exact commands do not work?

khuey avatar Sep 28 '21 17:09 khuey

We got this error just from running rr replay . in the trace directory

ianatol avatar Sep 28 '21 18:09 ianatol

Even before gdb starts?

What rev of rr are you using?

khuey avatar Sep 28 '21 18:09 khuey

Sorry, not before GDB starts but after the first c. This is on rr 5.3.0. Here's more of the rr crash trace:

GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ianatol/rr_debugging/rr_traces/julia-3/mmap_symlink_8_mmap_pack_20_julia...done.
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:35655
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.27.so...done.
done.
0x00007fe971d56b30 in ?? () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.

The trace from the OP follows immediately after.

ianatol avatar Sep 28 '21 19:09 ianatol

You should upgrade your rr. I can't reproduce this on 5.5.

khuey avatar Sep 28 '21 19:09 khuey

Sure thing, thanks for looking into this!

ianatol avatar Sep 28 '21 20:09 ianatol

I thought we'd checked master and indeed I just tried master again and see the same issue:

keno@antarctic:~/rrbugwat/rr_traces/julia-3$ ~/rr-build//bin/rr --version
rr version 5.5.0
keno@antarctic:~/rrbugwat/rr_traces/julia-3$ ~/rr-build//bin/rr replay .
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/keno/rrbugwat/rr_traces/julia-3/mmap_symlink_8_mmap_pack_20_julia...done.
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:39996
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.27.so...done.
done.
0x00007fe971d56b30 in ?? () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.
[FATAL /home/keno/rr/src/ReplaySession.cc:513:cont_syscall_boundary()]
 (task 39997 (rec:12113) at time 27)
 -> Assertion `false' failed to hold. Replay got unrecorded signal {signo:SIGSEGV,errno:SUCCESS,code:SEGV_MAPERR,addr:0x7fe971f79048}
Tail of trace dump:

Keno avatar Sep 28 '21 22:09 Keno

Ok ... maybe you need to update gdb then. It's working here.

khuey@minbar:~/dev/scratch/trace3/rr_traces/julia-3$ ~/dev/obj-rr/bin/rr --version
rr version 5.5.0
khuey@minbar:~/dev/scratch/trace3/rr_traces/julia-3$ ~/dev/obj-rr/bin/rr replay .
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/khuey/dev/scratch/trace3/rr_traces/julia-3/mmap_symlink_8_mmap_pack_20_julia...
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:52678
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file
0x00007fe971d56b30 in ?? () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.
julia_worker:9113#127.0.0.1
^C[New Thread 12113.12147]

Thread 1 stopped.
0x00007fe971f646d8 in ?? ()
(rr) when
Current event: 5000

khuey avatar Sep 28 '21 22:09 khuey

Fails with 10.1 also for me:

GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/keno/rrbugwat/rr_traces/julia-3/mmap_symlink_8_mmap_pack_20_julia...
Remote debugging using 127.0.0.1:2000
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file
0x00007fe971d56b30 in ?? () from /lib64/ld-linux-x86-64.so.2
(gdb) c
Continuing.
[FATAL /home/keno/rr/src/ReplaySession.cc:513:cont_syscall_boundary()]
 (task 602 (rec:12113) at time 27)
 -> Assertion `false' failed to hold. Replay got unrecorded signal {signo:SIGSEGV,errno:SUCCESS,code:SEGV_MAPERR,addr:0x7fe971f79048}

Keno avatar Sep 28 '21 23:09 Keno

Ah, it works fine with --serve-files (which we need to use anyway here, but forgot in the initial replay). I guess what happened is that GDB loaded the ld.so symbols from the local file system, but since that's different from the one in the trace, it got an incorrect location for one of its breakpoints and ended up twiddeling the bytes in the middle of the mov instruction (and it was different yet again on your system in such a way that it didn't cause a crash). Not sure there's much we can do here, except maybe try to validate whether the ld.so is the same between the trace and your local system if --serve-files isn't passed?

Keno avatar Sep 28 '21 23:09 Keno

Oof, that's a nasty footgun.

khuey avatar Sep 29 '21 00:09 khuey

Not sure there's much we can do here, except maybe try to validate whether the ld.so is the same between the trace and your local system if --serve-files isn't passed?

That sounds like a good idea. Is this something rr replay actually can check?

GitMensch avatar Oct 29 '21 09:10 GitMensch

Should a hack similar to f900475b5cf52524a411a342063d14226ff0e998 be applied here?

GitMensch avatar Nov 12 '21 18:11 GitMensch

Should a hack similar to f900475 be applied here?

If I understand this issue correctly the problem is that gdb picks up the wrong ld.so so I don't see how we would solve that by doing something like the linked commit in rr.

khuey avatar Nov 12 '21 19:11 khuey