rr icon indicating copy to clipboard operation
rr copied to clipboard

rr hangs on RHEL7

Open Yulmart opened this issue 4 years ago • 16 comments

When running rr record on RHEL7, rr seems to just hang.

Commit 29a59e412ad3e6600fcb37e9a5017f6eafce3b3d introduced this problem. It seems to be getting stuck in the waitid call here. Below is the backtrace from gdb right before it hangs.

(gdb) backtrace
#0  rr::Task::wait_exit (this=0xcd38a0) at /root/rr/src/Task.cc:161
#1  0x00000000008a3d0f in rr::Task::proceed_to_exit (this=0xcd38a0, wait=true) at /root/rr/src/Task.cc:184
#2  0x00000000007b6c9d in rr::handle_ptrace_exit_event (t=0xcd38a0) at /root/rr/src/RecordSession.cc:228
#3  0x00000000007bfbb8 in rr::RecordSession::record_step (this=0xcd1520) at /root/rr/src/RecordSession.cc:2321
#4  0x00000000007b3283 in rr::record (args=std::vector of length 1, capacity 2 = {...}, flags=...)
    at /root/rr/src/RecordCommand.cc:649
#5  0x00000000007b3d5b in rr::RecordCommand::run (this=0xccbeb0 <rr::RecordCommand::singleton>, 
    args=std::vector of length 1, capacity 2 = {...}) at /root/rr/src/RecordCommand.cc:792
#6  0x00000000008fa38b in main (argc=3, argv=0x7fffffffe108) at /root/rr/src/main.cc:268

Yulmart avatar Aug 20 '20 17:08 Yulmart

Does this happen on every program?

What kernel version?

khuey avatar Aug 20 '20 18:08 khuey

Does this happen on every program?

What kernel version?

$ uname -r
3.10.0-1127.el7.x86_64

I've tried testing rr on a simple C program and a few other command line utilities, and it appears to hang in all instances. The testsuite also seems to be suffering from the same problem as I'm seeing timeouts.

Yulmart avatar Aug 20 '20 18:08 Yulmart

I’ve never gotten rr to build properly on RHEL7, passing all tests. Is there a different set of build instructions for that? Thanks, Jim

On Aug 20, 2020, at 11:57 AM, Sagar Patel [email protected] wrote:

 When running rr record on RHEL7, rr seems to just hang.

Commit 29a59e4 introduced this problem. It seems to be getting stuck in the waitid call here. Below is the backtrace from GDB right before it hangs.

(gdb) backtrace #0 rr::Task::wait_exit (this=0xcd38a0) at /root/rr/src/Task.cc:161 #1 0x00000000008a3d0f in rr::Task::proceed_to_exit (this=0xcd38a0, wait=true) at /root/rr/src/Task.cc:184 #2 0x00000000007b6c9d in rr::handle_ptrace_exit_event (t=0xcd38a0) at /root/rr/src/RecordSession.cc:228 #3 0x00000000007bfbb8 in rr::RecordSession::record_step (this=0xcd1520) at /root/rr/src/RecordSession.cc:2321 #4 0x00000000007b3283 in rr::record (args=std::vector of length 1, capacity 2 = {...}, flags=...) at /root/rr/src/RecordCommand.cc:649 #5 0x00000000007b3d5b in rr::RecordCommand::run (this=0xccbeb0 rr::RecordCommand::singleton, args=std::vector of length 1, capacity 2 = {...}) at /root/rr/src/RecordCommand.cc:792 #6 0x00000000008fa38b in main (argc=3, argv=0x7fffffffe108) at /root/rr/src/main.cc:268 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

silkvine avatar Aug 20 '20 22:08 silkvine

I’ve never gotten rr to build properly on RHEL7, passing all tests. Is there a different set of build instructions for that? Thanks, Jim

Are you able to run rr to some extent on RHEL7? If so, what's the kernel version?

I'm building rr from source (along with this patch).

Yulmart avatar Aug 20 '20 22:08 Yulmart

Hey @Keno, do you have an idea as to what may be causing this issue?

Yulmart avatar Aug 24 '20 20:08 Yulmart

This stuff is pretty sensitive to the kernel getting wait notifications right, which definitely had problems at some point in the past. What the specific issue is though, I don't know.

Keno avatar Aug 24 '20 20:08 Keno

Possibly related to #2646.

Yulmart avatar Aug 24 '20 23:08 Yulmart

Probably not, if you see this on "a simple C program". #2646 requires PID namespaces.

khuey avatar Aug 24 '20 23:08 khuey

Probably not, if you see this on "a simple C program". #2646 requires PID namespaces.

Ah I see, thanks for clarifying.

Yulmart avatar Aug 25 '20 15:08 Yulmart

Additional information (strace results) that may help in diagnosing this issue.

RHEL7 results (right before it hangs):

...
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10161, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
waitid(P_PID, 10161, 0x7ffebc67a9e0, WSTOPPED|WNOWAIT, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

Fedora 32 results (no hanging):

...
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=19639, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
waitid(P_PID, 19639, 0x7ffe044162f0, WSTOPPED|WNOWAIT, NULL) = -1 ECHILD (No child processes)
waitid(P_PID, 19639, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=19639, si_uid=0, si_status=0, si_utime=0, si_stime=0}, WNOHANG|WEXITED, NULL) = 0
...

Yulmart avatar Aug 25 '20 15:08 Yulmart

I got the same hang on RHEL7 built with 5.3.0 source code. If I do not build by myself and use rr-5.3.0-Linux-x86_64.tar.gz directly, then no hang.

vicshen avatar Sep 20 '20 08:09 vicshen

Same here with 3.10.0-1127.el7.x86_64 kernel from CentOS Linux release 7.8.2003 (Core) :-(

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24227, si_uid=12427, si_status=0, si_utime=2, si_stime=3} ---
waitid(P_PID, 24227,

and once got

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=27927, si_uid=12427, si_status=0, si_utime=3, si_stime=5} ---
waitid(P_PID, 27927, ^C0x7ffe59af4cf0, WSTOPPED|WNOWAIT, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

Looks like a repo install:

/usr/bin/rr --version
rr version 5.4.0

Any hints how to debug and/or fix that? That's happening both with system provided GDB and GDB 9.2 build from source.

GitMensch avatar Jul 16 '21 17:07 GitMensch

Realistically we're not going to debug stuff on an 8 year old kernel and waitpid hanging immediately after the SIGCHLD like that sure looks like a kernel bug.

khuey avatar Jul 16 '21 17:07 khuey

:-/

Other than RHEL8 (where the initial tests were running) RHEL7/CentOS7 is an LTS version (and is because of that quite common in corporate environments), it would really be useful to be able to use rr there.

I think I'll try to get someone build master on this machine or a VM copy on Monday and recheck, even when my hope for the result to work did decrease a bit.

GitMensch avatar Jul 16 '21 19:07 GitMensch

I'm also experiencing this issue with RHEL7.

-bash-4.2$ rpm -q rr
rr-5.4.0-1.el7.x86_64
-bash-4.2$ uname -r
3.10.0-1160.90.1.el7.x86_64
-bash-4.2$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)

Is there a known working version for this kernel?

cheyngoodman avatar Aug 17 '23 18:08 cheyngoodman

That old version of the kernel is unsupported. That said - using the changes in the referenced pull request above fixed that issue for me.

GitMensch avatar Aug 17 '23 18:08 GitMensch