[BUG] Resource leak in SMP mode when running signal handler
Description / Steps to reproduce the issue
There is a serious issue with the current asynchronous signal delivery system; it will forcibly make another CPU resume code at places where this must not happen. Take the following example where CPU0 takes a semaphore and CPU1 sends a signal to it:
CPU0 CPU1
nxsem_wait() // Take semaphore
enter_critical_section()
... in atomic section ...
up_switch_context(this_task(), rtcb)
---> next process , atomic section over enter_critical_section()
nxsig_queue_action()
nxsched_smp_call_single()
// Setup interrupt on CPU0 to run sig_handler
nxsched_smp_call_single(stcb->cpu, sig_handler, &arg, true);
<--- SMP_CALL interrupt pends on CPU0
leave_critical_section()
---> SMP_CALL interrupt fires on CPU0
|
v
nxsched_smp_call_handler()
// Run sig_handler on CPU0
sig_handler()
up_schedule_sigaction()
// up_schedule_sigaction makes task on CPU1 return to riscv_sigdeliver
tcb->xcp.regs[REG_EPC] = (uintptr_t)riscv_sigdeliver;
|
v
riscv_smp_call_handler()
// riscv_smp_call_handler restores (new) context, EPC=riscv_sigdeliver
tcb = current_task(cpu);
riscv_savecontext(tcb);
nxsched_process_delivered(cpu);
tcb = current_task(cpu);
riscv_restorecontext(tcb);
|
v
riscv_smp_call_handler() interrupt returns
|
v
riscv_sigdeliver()
|
v
signal_handler()
// Signal handler runs in userspace
***CRASH***
If the process on CPU0 crashes in the signal handler, the semaphore taken on CPU0 does not get freed, causing a resource leak.
The leak is not an issue for user resources but is catastrophic for kernel resources!
On which OS does this issue occur?
[OS: Linux]
What is the version of your OS?
Irrelevant
NuttX Version
master
Issue Architecture
[Arch: all]
Issue Area
[Area: Posix]
Verification
- [X] I have verified before submitting the report.
I noticed the issue with syslog(), though in this case it causes a deadlock. The problem is still the same and the leaked resource is g_lowputs_lock.
static ssize_t syslog_default_write(FAR syslog_channel_t *channel,
FAR const char *buffer, size_t buflen)
{
# ifdef CONFIG_ARCH_LOWPUTC
nxmutex_lock(&g_lowputs_lock);
up_nputs(buffer, buflen);
nxmutex_unlock(&g_lowputs_lock);
# endif
UNUSED(channel);
return buflen;
}
#endif
All you need is for CPU0 to call syslog_default_write taking the g_lowputs_lock semaphore and CPU1 sending a signal to CPU0. When CPU0 starts running the signal dispatch logic it will deadlock if there is another syslog() call there. One example below:
https://github.com/apache/nuttx/blob/b4a6d456c99d3c3dac843f4a18f13c78afe9395a/arch/risc-v/src/common/riscv_sigdeliver.c#L55-L72
This is how it looks on the terminal
[CPU1] nxsig_tcbdispatch: TCB=0x80410ad0 pid=58 signo=1 code=0 value=0 masked=NO, svcall=YES
[CPU0] [CPU1] nxsig_tcbdispatch: TCB=0x80410ad0 pid=58 signo=2 code=0 value=0 masked=NO, svcall=YES
[CPU0] up_schedule_sigaction: tcb=0x80410ad0, rtcb=0x80410ad0 current_regs=0x80413720
[CPU0]
Steps to reproduce: Enable: CONFIG_DEBUG_FEATURES=y CONFIG_DEBUG_SCHED_INFO=y tools/configure.sh rv-virt:ksmp64
Start qemu and run ostest. It will get stuck in one of the signal handler tests almost every time.