ast
ast copied to clipboard
Investigate signal test failures caused by removal of vmalloc
signal
test fails consistently after removal of vmalloc and has been kept disabled. We should investigate why it fails and reenable this test.
read
builtin in empty_fifos
function gets blocked because there is a minor difference between time when read
should timeout and the time sigalrm
gets invoked.
This is related code in src/cmd/ksh93/sh/timers.c
:
107 if (tp->action) {
108 if (tp->wakeup <= now) {
109 if (!tpold || tpold->wakeup > tp->wakeup) tpold = tp;
110 } else {
From my gdb debugging session:
(gdb) p tp->wakeup
$1 = 1530278491.5283189
(gdb) p now
$2 = 1530278491.5183508
(gdb) p tp->wakeup <= now
$3 = 0
And sigalrm
never gets called again.
In my PR #631 I noticed that the handling of SIGALRM is suspect. I didn't explicitly fix it but it is possible, even likely, that it fixes this specific failure mode. However, I enabled the signal
test with that PR applied and it fails in new ways that are clearly suggestive of a use-after-free bug:
/tmp/ksh.signal.ypnl9dw/signal.sh: line 430: 37983: User signal 2
/tmp/ksh.signal.ypnl9dw/signal.sh: line 512: Èf: not found
/tmp/ksh.signal.ypnl9dw/signal.sh: line 512: ¢±^?: not found
That was on OpenSuse.
Also, on macOS I'm not seeing the above failures. I'm only seeing this one:
<E> signal[528]: expected si_queue or si_user, got |0|
Which is because macOS doesn't support sigqueue()
. Patch forthcoming to predicate the test on a platform which supports sigqueue()
.
@krader1961 I suspect you are not hitting this race because you are on a slower system. I am able to reproduce it consistently even with changes from #631.
@siteshwar The speed of my system is definitely not the issue. My primary workstation is a lightly loaded latest generation Mac Pro tower with 12GB of memory and a quad core 3.7 GHz Xeon CPU 😈. I have absolutely no difficulty reproducing the problem on my VMware hosted Linux distros. I can't reproduce it on macOS. Which is a very strong indication this is due to the behavior of the system malloc implementation. Furthermore, on Linux I see messages like this logged immediately after the first time function rttrap
is run in response to receiving one of the signals from a kill -q
:
/tmp/ksh.signal.sFbLdMs/signal.sh: line 515: èF^Fh^N^?: not found
Which is another strong indication of memory corruption -- most likely a use-after-free issue but possibly writing past the end of a dynamically allocated buffer.
The sole macOS failure I'm seeing is because macOS (and BSD systems more generally) don't populate the siginfo_t
member si_code
with the value of SI_USER
. I've got a change I'll commit which deals with that difference in behavior compared to Linux/SysV as soon as PR #631 is merged.
Note that modifying the malloc API wrappers in src/lib/libast/misc/vmbusy.c to increase every allocation by 4 KiB does not fix the issue. So it does not appear to be a case of writing past the end of a malloc'd buffer. Modifying ast_free()
to immediately return without freeing the buffer does eliminate the weird command "not found" errors but the test still times out. So it looks like there is more than one bug with one of them being a use-after-free bug.
Also, this is basically a duplicate of #525 where I deliberately disabled this test due to this failure mode. The hang when running the signal
test on macOS is no longer happening but the timeouts do still occur. So I'll close that issue in favor of tracking the problem here.
I have made several observations while debugging this issue. I am going to summarize them here:
-
Size of array that stores trap handlers is calculated based on value of SIGTERM. I think this should be increased to higher value. At least this block needs some review.
-
The code to restore trap handlers seems to cause use after free issues. It is done on couple of places:
-
malloc
is being called in a signal handler. I have opened a separate issue for it (See #647).
Gah! That block of code in fault.c is awful and not just because it's opaque. There are far easier ways to determine the max signal number and it should be done by a build time feature test; not when ksh starts.
Also this line
shp->gd->sigmax = ++n;
is wrong if sigmax
is supposed to be the maximum signal number. But looking at how that var is used it appears, despite the name, that it is supposed to be the max signum + 1. I would name it something like nsigs
or num_signals
. Which is admittedly itself somewhat problematic since there is no guarantee that signal numbers are contiguous.
@krader1961 The bug that I mentioned in second point of my previous comment is still reproducible, so I think we should reopen this.
I think we should reopen this.
Yes, that commit was meant to just reference this issue not close it.