seastar
seastar copied to clipboard
stall-detector: Try hard not to crash while collecting backtrace
Sometimes stall-detector signal comes in the middle of exception handling. If the stall is detected, stack unwiding starts to collect the stalled backtrace. Since exception handling means unwiding the stack as well, those two unwinders need to cooperate carefully, which is not guaranteed (spoiler: they don't cooperate carefully). In unlucky case, segmentation fault happens, the app is killed with SEGV.
This patch helps stall detector to bail out in case of SEGV arrival while collecting the backtrace with minimally possible yet detailed enough stall report.
Doesn't solve the problem entirely, since SIGSEGV isn't the only possible symptom (you could get an infinite loop for example, why not), but I guess it prevents a crash in the cases it's enough (which is probably a great majority of cases), and doesn't hurt in the others, so why not.
It's worth noting there is a reproducer now, for at least one type of crash, see https://github.com/scylladb/seastar/issues/2697 and https://github.com/scylladb/seastar/pull/2714. So perhaps it is worth revisiting this PR as part of the reason it has stalled seemed to be the lack of repro?
upd: rebased to check #2714
closing in favor of #2714