ast icon indicating copy to clipboard operation
ast copied to clipboard

Investigate signal test failures caused by removal of vmalloc

Open siteshwar opened this issue 6 years ago • 11 comments

signal test fails consistently after removal of vmalloc and has been kept disabled. We should investigate why it fails and reenable this test.

siteshwar avatar Jun 29 '18 13:06 siteshwar

read builtin in empty_fifos function gets blocked because there is a minor difference between time when read should timeout and the time sigalrm gets invoked.

This is related code in src/cmd/ksh93/sh/timers.c:

107                 if (tp->action) {
108                     if (tp->wakeup <= now) {
109                         if (!tpold || tpold->wakeup > tp->wakeup) tpold = tp;
110                     } else {

From my gdb debugging session:

(gdb) p tp->wakeup
$1 = 1530278491.5283189

(gdb) p now
$2 = 1530278491.5183508

(gdb) p tp->wakeup <= now
$3 = 0

And sigalrm never gets called again.

siteshwar avatar Jun 29 '18 13:06 siteshwar

In my PR #631 I noticed that the handling of SIGALRM is suspect. I didn't explicitly fix it but it is possible, even likely, that it fixes this specific failure mode. However, I enabled the signal test with that PR applied and it fails in new ways that are clearly suggestive of a use-after-free bug:

/tmp/ksh.signal.ypnl9dw/signal.sh: line 430: 37983: User signal 2
/tmp/ksh.signal.ypnl9dw/signal.sh: line 512: Èf: not found
/tmp/ksh.signal.ypnl9dw/signal.sh: line 512: ¢±^?: not found

That was on OpenSuse.

krader1961 avatar Jun 30 '18 03:06 krader1961

Also, on macOS I'm not seeing the above failures. I'm only seeing this one:

<E> signal[528]: expected si_queue or si_user, got |0|

Which is because macOS doesn't support sigqueue(). Patch forthcoming to predicate the test on a platform which supports sigqueue().

krader1961 avatar Jun 30 '18 03:06 krader1961

@krader1961 I suspect you are not hitting this race because you are on a slower system. I am able to reproduce it consistently even with changes from #631.

siteshwar avatar Jul 02 '18 20:07 siteshwar

@siteshwar The speed of my system is definitely not the issue. My primary workstation is a lightly loaded latest generation Mac Pro tower with 12GB of memory and a quad core 3.7 GHz Xeon CPU 😈. I have absolutely no difficulty reproducing the problem on my VMware hosted Linux distros. I can't reproduce it on macOS. Which is a very strong indication this is due to the behavior of the system malloc implementation. Furthermore, on Linux I see messages like this logged immediately after the first time function rttrap is run in response to receiving one of the signals from a kill -q:

/tmp/ksh.signal.sFbLdMs/signal.sh: line 515: èF^Fh^N^?: not found

Which is another strong indication of memory corruption -- most likely a use-after-free issue but possibly writing past the end of a dynamically allocated buffer.

The sole macOS failure I'm seeing is because macOS (and BSD systems more generally) don't populate the siginfo_t member si_code with the value of SI_USER. I've got a change I'll commit which deals with that difference in behavior compared to Linux/SysV as soon as PR #631 is merged.

krader1961 avatar Jul 03 '18 03:07 krader1961

Note that modifying the malloc API wrappers in src/lib/libast/misc/vmbusy.c to increase every allocation by 4 KiB does not fix the issue. So it does not appear to be a case of writing past the end of a malloc'd buffer. Modifying ast_free() to immediately return without freeing the buffer does eliminate the weird command "not found" errors but the test still times out. So it looks like there is more than one bug with one of them being a use-after-free bug.

Also, this is basically a duplicate of #525 where I deliberately disabled this test due to this failure mode. The hang when running the signal test on macOS is no longer happening but the timeouts do still occur. So I'll close that issue in favor of tracking the problem here.

krader1961 avatar Jul 03 '18 04:07 krader1961

I have made several observations while debugging this issue. I am going to summarize them here:

  1. Size of array that stores trap handlers is calculated based on value of SIGTERM. I think this should be increased to higher value. At least this block needs some review.

  2. The code to restore trap handlers seems to cause use after free issues. It is done on couple of places:

    • In src/cmd/ksh93/sh/xec.c this is the block of code that saves trap handlers and this block restores it. Commenting out free in the code that restore trap handlers fixes a use after free issue.

    • In src/cmd/ksh93/sh/subshell.c this code saves trap handlers and this code restores it.

  3. malloc is being called in a signal handler. I have opened a separate issue for it (See #647).

siteshwar avatar Jul 04 '18 15:07 siteshwar

Gah! That block of code in fault.c is awful and not just because it's opaque. There are far easier ways to determine the max signal number and it should be done by a build time feature test; not when ksh starts.

krader1961 avatar Jul 05 '18 02:07 krader1961

Also this line

shp->gd->sigmax = ++n;

is wrong if sigmax is supposed to be the maximum signal number. But looking at how that var is used it appears, despite the name, that it is supposed to be the max signum + 1. I would name it something like nsigs or num_signals. Which is admittedly itself somewhat problematic since there is no guarantee that signal numbers are contiguous.

krader1961 avatar Jul 05 '18 03:07 krader1961

@krader1961 The bug that I mentioned in second point of my previous comment is still reproducible, so I think we should reopen this.

siteshwar avatar Jul 05 '18 20:07 siteshwar

I think we should reopen this.

Yes, that commit was meant to just reference this issue not close it.

krader1961 avatar Jul 05 '18 21:07 krader1961