nuttx [BUG] Resource leak in SMP mode when running signal handler

Description / Steps to reproduce the issue

There is a serious issue with the current asynchronous signal delivery system; it will forcibly make another CPU resume code at places where this must not happen. Take the following example where CPU0 takes a semaphore and CPU1 sends a signal to it:

CPU0                                                CPU1
nxsem_wait() // Take semaphore                                     
enter_critical_section()                 
... in atomic section ...                           
up_switch_context(this_task(), rtcb)
---> next process , atomic section over             enter_critical_section()
                                                    nxsig_queue_action()
                                                    nxsched_smp_call_single()
                                                      // Setup interrupt on CPU0 to run sig_handler
                                                      nxsched_smp_call_single(stcb->cpu, sig_handler, &arg, true);
                                                    <--- SMP_CALL interrupt pends on CPU0
                                                    leave_critical_section()
---> SMP_CALL interrupt fires on CPU0
               |
               v
nxsched_smp_call_handler()
  // Run sig_handler on CPU0
  sig_handler()
  up_schedule_sigaction()
  // up_schedule_sigaction makes task on CPU1 return to riscv_sigdeliver
  tcb->xcp.regs[REG_EPC] = (uintptr_t)riscv_sigdeliver;
              |
              v
riscv_smp_call_handler()
  // riscv_smp_call_handler restores (new) context, EPC=riscv_sigdeliver
  tcb = current_task(cpu);
  riscv_savecontext(tcb);
  nxsched_process_delivered(cpu);
  tcb = current_task(cpu);
  riscv_restorecontext(tcb);
               |
               v
riscv_smp_call_handler() interrupt returns
               |
               v
riscv_sigdeliver()
               |
               v
signal_handler()
  // Signal handler runs in userspace
          ***CRASH***

If the process on CPU0 crashes in the signal handler, the semaphore taken on CPU0 does not get freed, causing a resource leak.

The leak is not an issue for user resources but is catastrophic for kernel resources!

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Irrelevant

NuttX Version

master

Issue Architecture

[Arch: all]

Issue Area

[Area: Posix]

Verification

[X] I have verified before submitting the report.

Oct 22 '24 11:10 pussuw

I noticed the issue with syslog(), though in this case it causes a deadlock. The problem is still the same and the leaked resource is g_lowputs_lock.

static ssize_t syslog_default_write(FAR syslog_channel_t *channel,
                                    FAR const char *buffer, size_t buflen)
{
#  ifdef CONFIG_ARCH_LOWPUTC
  nxmutex_lock(&g_lowputs_lock);

  up_nputs(buffer, buflen);

  nxmutex_unlock(&g_lowputs_lock);
#  endif

  UNUSED(channel);
  return buflen;
}
#endif

All you need is for CPU0 to call syslog_default_write taking the g_lowputs_lock semaphore and CPU1 sending a signal to CPU0. When CPU0 starts running the signal dispatch logic it will deadlock if there is another syslog() call there. One example below:

https://github.com/apache/nuttx/blob/b4a6d456c99d3c3dac843f4a18f13c78afe9395a/arch/risc-v/src/common/riscv_sigdeliver.c#L55-L72

This is how it looks on the terminal

[CPU1] nxsig_tcbdispatch: TCB=0x80410ad0 pid=58 signo=1 code=0 value=0 masked=NO, svcall=YES
[CPU0] [CPU1] nxsig_tcbdispatch: TCB=0x80410ad0 pid=58 signo=2 code=0 value=0 masked=NO, svcall=YES
[CPU0] up_schedule_sigaction: tcb=0x80410ad0, rtcb=0x80410ad0 current_regs=0x80413720
[CPU0]

Oct 22 '24 11:10 pussuw

Steps to reproduce: Enable: CONFIG_DEBUG_FEATURES=y CONFIG_DEBUG_SCHED_INFO=y tools/configure.sh rv-virt:ksmp64

Start qemu and run ostest. It will get stuck in one of the signal handler tests almost every time.

Oct 22 '24 12:10 pussuw