Unrecorded SIGSEGV during replay
While attempting to replay trace julia-3 in https://julialang-dumps.s3.amazonaws.com/linux64/3824/rr-run_3824-gitsha_da727d3563-2021-09-26_10_41_36.tar.zst, encountered a crash in rr at global_time 27:
[FATAL /workspace/srcdir/rr/src/ReplaySession.cc:491:cont_syscall_boundary()]
(task 21565 (rec:12113) at time 27)
-> Assertion `false' failed to hold. Replay got unrecorded signal {signo:SIGSEGV,errno:SUCCESS,code:SEGV_MAPERR,addr:0x7fe971f79048}
Tail of trace dump:
[snip] ...
{
real_time:69877284.393712 global_time:27, event:`SYSCALL: brk' (state:ENTERING_SYSCALL) tid:12113, ticks:189
rax:0xffffffffffffffda rbx:0x400040 rcx:0xffffffffffffffff rdx:0x0 rsi:0x4007e0 rdi:0x0 rbp:0x7fe971d587c0 rsp:0x7ffe547dd468 r8:0x6fffd000 r9:0x37f r10:0x64 r11:0x246 r12:0xa r13:0x7ffe547dd7e9 r14:0x0 r15:0x1000 rip:0x7fe971d6cb8a eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xc fs_base:0x0 gs_base:0x0
}
{
real_time:69877284.393746 global_time:28, event:`SYSCALL: brk' (state:EXITING_SYSCALL) tid:12113, ticks:189
rax:0x14df000 rbx:0x400040 rcx:0xffffffffffffffff rdx:0x0 rsi:0x4007e0 rdi:0x0 rbp:0x7fe971d587c0 rsp:0x7ffe547dd468 r8:0x6fffd000 r9:0x37f r10:0x64 r11:0x246 r12:0xa r13:0x7ffe547dd7e9 r14:0x0 r15:0x1000 rip:0x7fe971d6cb8a eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xc fs_base:0x0 gs_base:0x0
{ map_file:"<ZERO>", addr:0x14df000, length:(nil), prot_flags:"---p", file_offset:0x0, device:0, inode:0, data_file:"", data_offset:0x0, file_size:0x0 }
}
Replays fine with replay -a, so we believe there is an issue with the replayer rather than the trace itself
CC: @Keno
Replays fine with replay -a, so we believe there is an issue with the replayer rather than the trace itself
If replay -a works, can you be more specific about what exact commands do not work?
We got this error just from running rr replay . in the trace directory
Even before gdb starts?
What rev of rr are you using?
Sorry, not before GDB starts but after the first c. This is on rr 5.3.0. Here's more of the rr crash trace:
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ianatol/rr_debugging/rr_traces/julia-3/mmap_symlink_8_mmap_pack_20_julia...done.
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:35655
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.27.so...done.
done.
0x00007fe971d56b30 in ?? () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.
The trace from the OP follows immediately after.
You should upgrade your rr. I can't reproduce this on 5.5.
Sure thing, thanks for looking into this!
I thought we'd checked master and indeed I just tried master again and see the same issue:
keno@antarctic:~/rrbugwat/rr_traces/julia-3$ ~/rr-build//bin/rr --version
rr version 5.5.0
keno@antarctic:~/rrbugwat/rr_traces/julia-3$ ~/rr-build//bin/rr replay .
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/keno/rrbugwat/rr_traces/julia-3/mmap_symlink_8_mmap_pack_20_julia...done.
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:39996
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.27.so...done.
done.
0x00007fe971d56b30 in ?? () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.
[FATAL /home/keno/rr/src/ReplaySession.cc:513:cont_syscall_boundary()]
(task 39997 (rec:12113) at time 27)
-> Assertion `false' failed to hold. Replay got unrecorded signal {signo:SIGSEGV,errno:SUCCESS,code:SEGV_MAPERR,addr:0x7fe971f79048}
Tail of trace dump:
Ok ... maybe you need to update gdb then. It's working here.
khuey@minbar:~/dev/scratch/trace3/rr_traces/julia-3$ ~/dev/obj-rr/bin/rr --version
rr version 5.5.0
khuey@minbar:~/dev/scratch/trace3/rr_traces/julia-3$ ~/dev/obj-rr/bin/rr replay .
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/khuey/dev/scratch/trace3/rr_traces/julia-3/mmap_symlink_8_mmap_pack_20_julia...
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:52678
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file
0x00007fe971d56b30 in ?? () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.
julia_worker:9113#127.0.0.1
^C[New Thread 12113.12147]
Thread 1 stopped.
0x00007fe971f646d8 in ?? ()
(rr) when
Current event: 5000
Fails with 10.1 also for me:
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/keno/rrbugwat/rr_traces/julia-3/mmap_symlink_8_mmap_pack_20_julia...
Remote debugging using 127.0.0.1:2000
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file
0x00007fe971d56b30 in ?? () from /lib64/ld-linux-x86-64.so.2
(gdb) c
Continuing.
[FATAL /home/keno/rr/src/ReplaySession.cc:513:cont_syscall_boundary()]
(task 602 (rec:12113) at time 27)
-> Assertion `false' failed to hold. Replay got unrecorded signal {signo:SIGSEGV,errno:SUCCESS,code:SEGV_MAPERR,addr:0x7fe971f79048}
Ah, it works fine with --serve-files (which we need to use anyway here, but forgot in the initial replay). I guess what happened is that GDB loaded the ld.so symbols from the local file system, but since that's different from the one in the trace, it got an incorrect location for one of its breakpoints and ended up twiddeling the bytes in the middle of the mov instruction (and it was different yet again on your system in such a way that it didn't cause a crash). Not sure there's much we can do here, except maybe try to validate whether the ld.so is the same between the trace and your local system if --serve-files isn't passed?
Oof, that's a nasty footgun.
Not sure there's much we can do here, except maybe try to validate whether the ld.so is the same between the trace and your local system if
--serve-filesisn't passed?
That sounds like a good idea. Is this something rr replay actually can check?
Should a hack similar to f900475b5cf52524a411a342063d14226ff0e998 be applied here?
Should a hack similar to f900475 be applied here?
If I understand this issue correctly the problem is that gdb picks up the wrong ld.so so I don't see how we would solve that by doing something like the linked commit in rr.