nuttx add a reference count to the TCB to prevent it from being deleted.

Summary

add a reference count to the TCB to prevent it from being deleted.

To replace the large lock with smaller ones and reduce the large locks related to the TCB, in many scenarios, we only need to ensure that the TCB won't be released instead of locking, thus reducing the possibility of lock recursion.

should merge with https://github.com/apache/nuttx-apps/pull/3246

Impact

tcb release

Testing

test in hardware

esp32s3-devkit:nsh

user_main: scheduler lock test sched_lock: Starting lowpri_thread at 97 sched_lock: Set lowpri_thread priority to 97 sched_lock: Starting highpri_thread at 98 sched_lock: Set highpri_thread priority to 98 sched_lock: Waiting... sched_lock: PASSED No pre-emption occurred while scheduler was locked. sched_lock: Starting lowpri_thread at 97 sched_lock: Set lowpri_thread priority to 97 sched_lock: Starting highpri_thread at 98 sched_lock: Set highpri_thread priority to 98 sched_lock: Waiting... sched_lock: PASSED No pre-emption occurred while scheduler was locked. sched_lock: Finished

End of test memory usage: VARIABLE BEFORE AFTER ======== ======== ======== arena 5d8bc 5d8bc ordblks 7 6 mxordblk 548a0 548a0 uordblks 5014 5014 fordblks 588a8 588a8

Final memory usage: VARIABLE BEFORE AFTER ======== ======== ======== arena 5d8bc 5d8bc ordblks 1 6 mxordblk 59238 548a0 uordblks 4684 5014 fordblks 59238 588a8 user_main: Exiting ostest_main: Exiting with status 0 nsh> u nsh: u: command not found nsh> nsh> nsh> nsh> uname -a NuttX 12.11.0 ef91333e3ac-dirty Dec 10 2025 16:11:04 xtensa esp32s3-devkit nsh>

Dec 10 '25 08:12 hujun260

There are some board level changes on this PR, please separate those into another commit.

Dec 11 '25 11:12 fdcavalcanti

@hujun260 I think modifications that touch the scheduler need to be tested in more hardware (and not only 32-bit, since we support 8-to-64 uC/uP). I know it is difficult to test in many hardware, but at least AVR, STM32, ESP32 and BCM2711 should be tested.

Dec 11 '25 11:12 acassis

I really fail to see the rationale behind trying to turn NuttX into Linux. The same issue cropped up when you modified spin_lock before—if we are going to align everything with Linux, then why not just use a tailored Linux distribution instead?

Dec 11 '25 12:12 anchao

I really fail to see the rationale behind trying to turn NuttX into Linux. The same issue cropped up when you modified spin_lock before—if we are going to align everything with Linux, then why not just use a tailored Linux distribution instead?

Good point! I think the NuttX focus should be MCUs with realtime requirements (something with Linux cannot offer). In the other hand many companies want NuttX to be a replacement to Linux itself. I think both goals don't need to be conflicting. But we cannot lose the main focus just to be more Linux-like.

@hujun260 @anchao @xiaoxiang781216 how to fix this conflict in a way both goals could be meet?

Dec 11 '25 12:12 acassis

I really fail to see the rationale behind trying to turn NuttX into Linux. The same issue cropped up when you modified spin_lock before—if we are going to align everything with Linux, then why not just use a tailored Linux distribution instead?

Good point! I think the NuttX focus should be MCUs with realtime requirements (something with Linux cannot offer). In the other hand many companies want NuttX to be a replacement to Linux itself. I think both goals don't need to be conflicting. But we cannot lose the main focus just to be more Linux-like.

@hujun260 @anchao @xiaoxiang781216 how to fix this conflict in a way both goals could be meet?

Nobody want to change NuttX to Linux, but the big critical section need be addressed since:

More and more SoC equip with multple core, but SMP performance is very poor with one big critical section
One big critical section also make the interrrupt latency larger than expect

So, we need make a balance here.

Dec 11 '25 15:12 xiaoxiang781216

I really fail to see the rationale behind trying to turn NuttX into Linux. The same issue cropped up when you modified spin_lock before—if we are going to align everything with Linux, then why not just use a tailored Linux distribution instead?

Good point! I think the NuttX focus should be MCUs with realtime requirements (something with Linux cannot offer). In the other hand many companies want NuttX to be a replacement to Linux itself. I think both goals don't need to be conflicting. But we cannot lose the main focus just to be more Linux-like. @hujun260 @anchao @xiaoxiang781216 how to fix this conflict in a way both goals could be meet?

Nobody want to change NuttX to Linux, but the big critical section need be addressed since:
1. More and more SoC equip with multple core, but SMP performance is very poor with one big critical section

2. One big critical section also make the interrrupt latency larger than expect
So, we need make a balance here.

Ok, I think we need to understand the impact it will cause in non-SMP MCUs. Could you please provide some testing measurements before and after this modification?

Dec 11 '25 15:12 acassis

I really fail to see the rationale behind trying to turn NuttX into Linux. The same issue cropped up when you modified spin_lock before—if we are going to align everything with Linux, then why not just use a tailored Linux distribution instead?

Good point! I think the NuttX focus should be MCUs with realtime requirements (something with Linux cannot offer). In the other hand many companies want NuttX to be a replacement to Linux itself. I think both goals don't need to be conflicting. But we cannot lose the main focus just to be more Linux-like. @hujun260 @anchao @xiaoxiang781216 how to fix this conflict in a way both goals could be meet?

Nobody want to change NuttX to Linux, but the big critical section need be addressed since:
1. More and more SoC equip with multple core, but SMP performance is very poor with one big critical section

2. One big critical section also make the interrrupt latency larger than expect
So, we need make a balance here.
Ok, I think we need to understand the impact it will cause in non-SMP MCUs. Could you please provide some testing measurements before and after this modification?

I ran osperf on a single-core ESP32 for performance evaluation, and the key metrics showed little change after applying the patch.

Dec 12 '25 08:12 hujun260

I really fail to see the rationale behind trying to turn NuttX into Linux. The same issue cropped up when you modified spin_lock before—if we are going to align everything with Linux, then why not just use a tailored Linux distribution instead?

Good point! I think the NuttX focus should be MCUs with realtime requirements (something with Linux cannot offer). In the other hand many companies want NuttX to be a replacement to Linux itself. I think both goals don't need to be conflicting. But we cannot lose the main focus just to be more Linux-like. @hujun260 @anchao @xiaoxiang781216 how to fix this conflict in a way both goals could be meet?

Nobody want to change NuttX to Linux, but the big critical section need be addressed since:
1. More and more SoC equip with multple core, but SMP performance is very poor with one big critical section

2. One big critical section also make the interrrupt latency larger than expect
So, we need make a balance here.
Ok, I think we need to understand the impact it will cause in non-SMP MCUs. Could you please provide some testing measurements before and after this modification?
I ran osperf on a single-core ESP32 for performance evaluation, and the key metrics showed little change after applying the patch.

@anchao seems like the impact is not so negative as you thought. I think it is fine! What you think?

Dec 12 '25 11:12 acassis

@anchao seems like the impact is not so negative as you thought. I think it is fine! What you think?

@acassis This result is SMP, not AMP.

Also, you can see that this commit contains many changes to nxsched_get_tcb/nxsched_put_tcb, and these new lines of code are definitely impacting performance.

Furthermore, in operating systems with defined tasks, task reference counting is useless and only increases the system load.
I wonder if all of you are aware that NuttX offers no competitive edge over Zephyr and FreeRTOS in terms of performance metrics such as thread creation, context switching, interrupt response, and semaphore operations. Given this fact, why should we still impose such overheads on NuttX users?

Dec 12 '25 13:12 anchao

Which one delivers better performance is self-evident.

Dec 12 '25 13:12 anchao

Is it possible to make this PR optional? Perhaps it should affect only the bigger SMP-capable chips and not the smaller single core ones?

Dec 12 '25 15:12 hartmannathan

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:


mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}

In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

Dec 16 '25 12:12 GUIDINGLI

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:
mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}
In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

@GUIDINGLI thank you for bring more information to this discussion. I think this commit will be beneficial to NuttX, but I think the data that @hujun260 brought is not helping much: osperf results are inconclusive, see:

why context-switch have decreased 3.36 times?
why hpwork decreased as well?
why pipe-rw had an overhead of 35% ?
why semwait became almost 2x fast?

Seems like osperf is not reliable for this kind of test, maybe we need to use Segger JLink RTT or OBRTrace Orbuculum to get more precise/reliable data.

Dec 16 '25 13:12 acassis

@anchao @acassis

Patch Motivation & Proper Solution:

The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP:

These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections:

Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:
mm_memdump() 

{

    enter_critical_section();

    tcb = nxsched_get_tcb(pid);

    syslog();             // might wake up or schedule another thread

    name = get_task_name(tcb);

    leave_critical_section();

}
In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice:

The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact:

Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

@GUIDINGLI @hujun260 This explanation is extremely helpful for giving context and understanding this change. I recommend to please add this explanation to the PR description and also to the commit log, so that it can be easily understood in the future. Thank you!

Dec 16 '25 13:12 hartmannathan

@acassis @hartmannathan Thanks for feeding back. We will re-test the performance data, not only osperf, but also rtos-benchmark. Also provide the flash data.

Dec 16 '25 13:12 GUIDINGLI

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:
mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}
In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

You can try running this code. The nxsched_get_tcb() function incurs a 2.5~4% performance overhead, and this function is used throughout the entire nuttx:

#include <nuttx/config.h>
#include <stdio.h>
#include <pthread.h>

/****************************************************************************
 * Public Functions
 ****************************************************************************/

/****************************************************************************
 * hello_main
 ****************************************************************************/
pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

static void timespec_diff(const struct timespec *start, const struct timespec *end, struct timespec *diff) {
    diff->tv_sec = end->tv_sec - start->tv_sec;
    diff->tv_nsec = end->tv_nsec - start->tv_nsec;
    
    if (diff->tv_nsec < 0) {
        diff->tv_sec--;
        diff->tv_nsec += 1000000000;
    }
}

static void timespec_add(struct timespec *total, const struct timespec *diff) {
    total->tv_sec += diff->tv_sec;
    total->tv_nsec += diff->tv_nsec;
    
    if (total->tv_nsec >= 1000000000) {
        total->tv_sec += total->tv_nsec / 1000000000;
        total->tv_nsec = total->tv_nsec % 1000000000;
    }
}

static void timespec_avg(const struct timespec *total, int count, struct timespec *avg) {
    uint64_t total_ns = (uint64_t)total->tv_sec * 1000000000 + total->tv_nsec;
    uint64_t avg_ns = total_ns / count;
    
    avg->tv_sec = avg_ns / 1000000000;
    avg->tv_nsec = avg_ns % 1000000000;
}

int main(int argc, char *argv[])                                                                               
{             
    struct timespec start;
    struct timespec end;                                                                                             
    struct timespec diff;
    struct timespec total = {0, 0};
    struct timespec avg;           
    int i, j = 0;
    const int loop_count = 10;
    
    while (j < loop_count)
    { 
        i = 0;
        j++;
        
        clock_gettime(CLOCK_BOOTTIME, &start);
        pthread_mutex_lock(&g_mutex);
        while (i < 1000 * 1000)
        { 
            i++;
            pthread_mutex_trylock(&g_mutex);
        }
        pthread_mutex_unlock(&g_mutex);
        
        clock_gettime(CLOCK_BOOTTIME, &end);
        
        timespec_diff(&start, &end, &diff);
        
        timespec_add(&total, &diff);
        
        timespec_avg(&total, j, &avg);
        
        printf("%d: diff = %lu.%09lu s | avg = %lu.%09lu s\n", 
               j, 
               (unsigned long)diff.tv_sec, (unsigned long)diff.tv_nsec,
               (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    }
    
    printf("\n===== result =====\n");
    printf("count: %d\n", loop_count);
    printf("total: %lu.%09lu s\n", (unsigned long)total.tv_sec, (unsigned long)total.tv_nsec);
    printf("avg: %lu.%09lu s\n", (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    
    return 0;
}

Additionally, I conducted the validation on the sim host PC; running this test on an MCU may result in even greater performance degradation:

Before this change:

nsh> hello
1: diff = 0.306540108 s | avg = 0.306540108 s
2: diff = 0.297250552 s | avg = 0.301895330 s
3: diff = 0.296764389 s | avg = 0.300185016 s
4: diff = 0.297345113 s | avg = 0.299475040 s
5: diff = 0.296678248 s | avg = 0.298915682 s
6: diff = 0.296436913 s | avg = 0.298502553 s
7: diff = 0.296414429 s | avg = 0.298204250 s
8: diff = 0.296148430 s | avg = 0.297947272 s
9: diff = 0.296353964 s | avg = 0.297770238 s
10: diff = 0.296554913 s | avg = 0.297648705 s

===== result =====
count: 10
total: 2.976487059 s
avg: 0.297648705 s

After this change:

nsh> hello
1: diff = 0.322894500 s | avg = 0.322894500 s
2: diff = 0.304714785 s | avg = 0.313804642 s
3: diff = 0.303939787 s | avg = 0.310516357 s
4: diff = 0.304068904 s | avg = 0.308904494 s
5: diff = 0.304900920 s | avg = 0.308103779 s
6: diff = 0.304780429 s | avg = 0.307549887 s
7: diff = 0.304122773 s | avg = 0.307060299 s
8: diff = 0.304266724 s | avg = 0.306711102 s
9: diff = 0.304566429 s | avg = 0.306472805 s
10: diff = 0.303873630 s | avg = 0.306212888 s

===== result =====
count: 10
total: 3.062128881 s
avg: 0.306212888 s

Dec 16 '25 14:12 anchao

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths. Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue. Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:
mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}
In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime. On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products. On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements. If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.
You can try running this code. The nxsched_get_tcb() function incurs a 2.5~4% performance overhead, and this function is used throughout the entire nuttx:
#include <nuttx/config.h>
#include <stdio.h>
#include <pthread.h>

/****************************************************************************
 * Public Functions
 ****************************************************************************/

/****************************************************************************
 * hello_main
 ****************************************************************************/
pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

static void timespec_diff(const struct timespec *start, const struct timespec *end, struct timespec *diff) {
    diff->tv_sec = end->tv_sec - start->tv_sec;
    diff->tv_nsec = end->tv_nsec - start->tv_nsec;
    
    if (diff->tv_nsec < 0) {
        diff->tv_sec--;
        diff->tv_nsec += 1000000000;
    }
}

static void timespec_add(struct timespec *total, const struct timespec *diff) {
    total->tv_sec += diff->tv_sec;
    total->tv_nsec += diff->tv_nsec;
    
    if (total->tv_nsec >= 1000000000) {
        total->tv_sec += total->tv_nsec / 1000000000;
        total->tv_nsec = total->tv_nsec % 1000000000;
    }
}

static void timespec_avg(const struct timespec *total, int count, struct timespec *avg) {
    uint64_t total_ns = (uint64_t)total->tv_sec * 1000000000 + total->tv_nsec;
    uint64_t avg_ns = total_ns / count;
    
    avg->tv_sec = avg_ns / 1000000000;
    avg->tv_nsec = avg_ns % 1000000000;
}

int main(int argc, char *argv[])                                                                               
{             
    struct timespec start;
    struct timespec end;                                                                                             
    struct timespec diff;
    struct timespec total = {0, 0};
    struct timespec avg;           
    int i, j = 0;
    const int loop_count = 10;
    
    while (j < loop_count)
    { 
        i = 0;
        j++;
        
        clock_gettime(CLOCK_BOOTTIME, &start);
        pthread_mutex_lock(&g_mutex);
        while (i < 1000 * 1000)
        { 
            i++;
            pthread_mutex_trylock(&g_mutex);
        }
        pthread_mutex_unlock(&g_mutex);
        
        clock_gettime(CLOCK_BOOTTIME, &end);
        
        timespec_diff(&start, &end, &diff);
        
        timespec_add(&total, &diff);
        
        timespec_avg(&total, j, &avg);
        
        printf("%d: diff = %lu.%09lu s | avg = %lu.%09lu s\n", 
               j, 
               (unsigned long)diff.tv_sec, (unsigned long)diff.tv_nsec,
               (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    }
    
    printf("\n===== result =====\n");
    printf("count: %d\n", loop_count);
    printf("total: %lu.%09lu s\n", (unsigned long)total.tv_sec, (unsigned long)total.tv_nsec);
    printf("avg: %lu.%09lu s\n", (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    
    return 0;
}
Additionally, I conducted the validation on the sim host PC; running this test on an MCU may result in even greater performance degradation:

Before this change:
nsh> hello
1: diff = 0.306540108 s | avg = 0.306540108 s
2: diff = 0.297250552 s | avg = 0.301895330 s
3: diff = 0.296764389 s | avg = 0.300185016 s
4: diff = 0.297345113 s | avg = 0.299475040 s
5: diff = 0.296678248 s | avg = 0.298915682 s
6: diff = 0.296436913 s | avg = 0.298502553 s
7: diff = 0.296414429 s | avg = 0.298204250 s
8: diff = 0.296148430 s | avg = 0.297947272 s
9: diff = 0.296353964 s | avg = 0.297770238 s
10: diff = 0.296554913 s | avg = 0.297648705 s

===== result =====
count: 10
total: 2.976487059 s
avg: 0.297648705 s
After this change:
nsh> hello
1: diff = 0.322894500 s | avg = 0.322894500 s
2: diff = 0.304714785 s | avg = 0.313804642 s
3: diff = 0.303939787 s | avg = 0.310516357 s
4: diff = 0.304068904 s | avg = 0.308904494 s
5: diff = 0.304900920 s | avg = 0.308103779 s
6: diff = 0.304780429 s | avg = 0.307549887 s
7: diff = 0.304122773 s | avg = 0.307060299 s
8: diff = 0.304266724 s | avg = 0.306711102 s
9: diff = 0.304566429 s | avg = 0.306472805 s
10: diff = 0.303873630 s | avg = 0.306212888 s

===== result =====
count: 10
total: 3.062128881 s
avg: 0.306212888 s

Nice work @anchao !!!

Based on this test alone, I can see the overhead was about 3.26% => ((0.303873630 / 0.296554913) - 1) * 100

So, we need to decide whether make sense to accept this overhead to get the benefit from this PR or not. Alternatively as @hartmannathan pointed this feature could be configurable (although I think NuttX already have enough control switches, we need to remove some instead of adding it)

Dec 16 '25 15:12 acassis

@anchao Thank you for providing some measurement data. As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression. We are also actively measuring the data and looking for opportunities to minimize any performance impact.

Dec 16 '25 15:12 GUIDINGLI

@anchao Thank you for providing some measurement data. As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression. We are also actively measuring the data and looking for opportunities to minimize any performance impact.

I think we are aligned on the same objective. NuttX is a real-time operating system, and our modifications should be more deterministic and faster, and should not impose any additional burden on developers familiar with NuttX.

Dec 16 '25 15:12 anchao