rr
rr copied to clipboard
rr hangs on RHEL7
When running rr record
on RHEL7, rr
seems to just hang.
Commit 29a59e412ad3e6600fcb37e9a5017f6eafce3b3d introduced this problem. It seems to be getting stuck in the waitid
call here. Below is the backtrace from gdb
right before it hangs.
(gdb) backtrace
#0 rr::Task::wait_exit (this=0xcd38a0) at /root/rr/src/Task.cc:161
#1 0x00000000008a3d0f in rr::Task::proceed_to_exit (this=0xcd38a0, wait=true) at /root/rr/src/Task.cc:184
#2 0x00000000007b6c9d in rr::handle_ptrace_exit_event (t=0xcd38a0) at /root/rr/src/RecordSession.cc:228
#3 0x00000000007bfbb8 in rr::RecordSession::record_step (this=0xcd1520) at /root/rr/src/RecordSession.cc:2321
#4 0x00000000007b3283 in rr::record (args=std::vector of length 1, capacity 2 = {...}, flags=...)
at /root/rr/src/RecordCommand.cc:649
#5 0x00000000007b3d5b in rr::RecordCommand::run (this=0xccbeb0 <rr::RecordCommand::singleton>,
args=std::vector of length 1, capacity 2 = {...}) at /root/rr/src/RecordCommand.cc:792
#6 0x00000000008fa38b in main (argc=3, argv=0x7fffffffe108) at /root/rr/src/main.cc:268
Does this happen on every program?
What kernel version?
Does this happen on every program?
What kernel version?
$ uname -r
3.10.0-1127.el7.x86_64
I've tried testing rr
on a simple C program and a few other command line utilities, and it appears to hang in all instances. The testsuite also seems to be suffering from the same problem as I'm seeing timeouts.
I’ve never gotten rr to build properly on RHEL7, passing all tests. Is there a different set of build instructions for that? Thanks, Jim
On Aug 20, 2020, at 11:57 AM, Sagar Patel [email protected] wrote:
When running rr record on RHEL7, rr seems to just hang.
Commit 29a59e4 introduced this problem. It seems to be getting stuck in the waitid call here. Below is the backtrace from GDB right before it hangs.
(gdb) backtrace #0 rr::Task::wait_exit (this=0xcd38a0) at /root/rr/src/Task.cc:161 #1 0x00000000008a3d0f in rr::Task::proceed_to_exit (this=0xcd38a0, wait=true) at /root/rr/src/Task.cc:184 #2 0x00000000007b6c9d in rr::handle_ptrace_exit_event (t=0xcd38a0) at /root/rr/src/RecordSession.cc:228 #3 0x00000000007bfbb8 in rr::RecordSession::record_step (this=0xcd1520) at /root/rr/src/RecordSession.cc:2321 #4 0x00000000007b3283 in rr::record (args=std::vector of length 1, capacity 2 = {...}, flags=...) at /root/rr/src/RecordCommand.cc:649 #5 0x00000000007b3d5b in rr::RecordCommand::run (this=0xccbeb0 rr::RecordCommand::singleton, args=std::vector of length 1, capacity 2 = {...}) at /root/rr/src/RecordCommand.cc:792 #6 0x00000000008fa38b in main (argc=3, argv=0x7fffffffe108) at /root/rr/src/main.cc:268 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I’ve never gotten rr to build properly on RHEL7, passing all tests. Is there a different set of build instructions for that? Thanks, Jim
Are you able to run rr
to some extent on RHEL7? If so, what's the kernel version?
I'm building rr
from source (along with this patch).
Hey @Keno, do you have an idea as to what may be causing this issue?
This stuff is pretty sensitive to the kernel getting wait notifications right, which definitely had problems at some point in the past. What the specific issue is though, I don't know.
Possibly related to #2646.
Probably not, if you see this on "a simple C program". #2646 requires PID namespaces.
Probably not, if you see this on "a simple C program". #2646 requires PID namespaces.
Ah I see, thanks for clarifying.
Additional information (strace
results) that may help in diagnosing this issue.
RHEL7 results (right before it hangs):
...
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10161, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
waitid(P_PID, 10161, 0x7ffebc67a9e0, WSTOPPED|WNOWAIT, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
Fedora 32 results (no hanging):
...
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=19639, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
waitid(P_PID, 19639, 0x7ffe044162f0, WSTOPPED|WNOWAIT, NULL) = -1 ECHILD (No child processes)
waitid(P_PID, 19639, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=19639, si_uid=0, si_status=0, si_utime=0, si_stime=0}, WNOHANG|WEXITED, NULL) = 0
...
I got the same hang on RHEL7 built with 5.3.0 source code. If I do not build by myself and use rr-5.3.0-Linux-x86_64.tar.gz directly, then no hang.
Same here with 3.10.0-1127.el7.x86_64
kernel from CentOS Linux release 7.8.2003 (Core)
:-(
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24227, si_uid=12427, si_status=0, si_utime=2, si_stime=3} ---
waitid(P_PID, 24227,
and once got
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=27927, si_uid=12427, si_status=0, si_utime=3, si_stime=5} ---
waitid(P_PID, 27927, ^C0x7ffe59af4cf0, WSTOPPED|WNOWAIT, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
Looks like a repo install:
/usr/bin/rr --version
rr version 5.4.0
Any hints how to debug and/or fix that? That's happening both with system provided GDB and GDB 9.2 build from source.
Realistically we're not going to debug stuff on an 8 year old kernel and waitpid hanging immediately after the SIGCHLD like that sure looks like a kernel bug.
:-/
Other than RHEL8 (where the initial tests were running) RHEL7/CentOS7 is an LTS version (and is because of that quite common in corporate environments), it would really be useful to be able to use rr there.
I think I'll try to get someone build master on this machine or a VM copy on Monday and recheck, even when my hope for the result to work did decrease a bit.
I'm also experiencing this issue with RHEL7.
-bash-4.2$ rpm -q rr
rr-5.4.0-1.el7.x86_64
-bash-4.2$ uname -r
3.10.0-1160.90.1.el7.x86_64
-bash-4.2$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)
Is there a known working version for this kernel?
That old version of the kernel is unsupported. That said - using the changes in the referenced pull request above fixed that issue for me.