nuttx icon indicating copy to clipboard operation
nuttx copied to clipboard

Revert "fs/inode: add pre-allocated task files to avoid allocator access

Open Donny9 opened this issue 1 year ago • 21 comments

Summary

Revert "fs/inode: add pre-allocated task files to avoid allocator access

This reverts commit 086834aae1722af2797fd5d4d0f79226ac11c2e9.

The refs in the filelist cannot protect the handles in the prefiles The prefiles and its associated group are allocated together, and when the group leaves, they are released together. During a reboot process, the sync operation may encounter a "used after free" issue.

issue stack:

(filep=filep@entry=0x3c482770) at ../../../fs/vfs/fs_close.c:80
../../../fs/vfs/fs_close.c:118
../../../fs/inode/fs_files.c:476
/home/cibuild/Public/jenkinsversion/2148/nuttx/include/nuttx/fs/fs.h:870
at ../../../fs/inode/fs_files.c:223
<task_fssync(tcb_s*, void*)>, arg=0x0) at

../../../sched/sched/sched_foreach.c:73 data=) at ../../../fs/fs_initialize.c:48 ../../../sched/misc/reboot_notifier.c:87 ../../../boards/boardctl.c:415 argv=0x3d3e07e8) at ../../../../apps/nshlib/nsh_syscmds.c:465 argc=argc@entry=2, argv=0x3d3e07e8, argv@entry=0x3d3e0820) at ../../../../apps/nshlib/nsh_command.c:1247 argv=0x3d3e0820, argc=2, vtbl=0x3c5d2aa8) at ../../../../apps/nshlib/nsh_parse.c:847 cmdline=cmdline@entry=0x3d3df93a "reboot") at ../../../../apps/nshlib/nsh_parse.c:2757 "reboot") at ../../../../apps/nshlib/nsh_parse.c:2844 argc=argc@entry=3, argv=0x3d3df920) at ../../../../apps/nshlib/nsh_session.c:146 char**)>, isctty=isctty@entry=0) at ../../../../apps/nshlib/nsh_system.c:47 out>) at ../../../../apps/nshlib/nsh_system.c:80 entrypt=) at ../../../libs/libc/sched/task_startup.c:70

And the original prefile improvement was only 3 microseconds, which is very minimal, so reverting it is the best approach.

Impact

Bug fix

Testing

local test

Donny9 avatar Oct 20 '24 15:10 Donny9

[Experimental Bot, please feedback here]

The PR summary does not fully meet NuttX requirements. Here's a breakdown and suggestions for improvement:

Strengths:

  • Clear problem description: The summary clearly explains the bug caused by the original change (used after free issue during reboot).
  • Rationale for revert: The justification for reverting is well-explained (minimal performance gain vs. introduced bug).
  • Impact is concise: Clearly states the PR is a bug fix.
  • Testing mentioned: Mentions local testing, which is a good start.

Areas for Improvement:

  • Missing Technical Details:
    • Functional Change: The summary mentions reverting file pre-allocation but lacks specifics on what code is being reverted and how the file handling mechanism will work after the revert.
    • How Revert Works: A brief explanation of how reverting solves the "used after free" issue would be beneficial (e.g., "Reverting ensures that file handles are released correctly during the shutdown process...").
  • Incomplete Testing Information:
    • No Build Host Details: Specify the operating system, CPU architecture, and compiler used for local testing.
    • No Target Details: Provide information about the target architecture and board configuration used for testing.
    • Missing Logs: The "Testing logs before change" and "Testing logs after change" sections are empty. Include relevant log snippets demonstrating the bug before the revert and the fix afterward.

Revised Summary (Example):

This PR reverts commit 086834aae1722af2797fd5d4d0f79226ac11c2e9, which introduced pre-allocated task files to improve performance. However, this change caused a "used after free" issue during reboot due to the way pre-allocated files and their associated groups are released.

This revert removes the pre-allocation mechanism, reverting to the original file handling logic. This ensures that file handles are released correctly during shutdown, preventing the "used after free" error.

Revised Testing (Example):

Build Host(s): Linux (Ubuntu 20.04), x86_64, GCC 9.4.0 Target(s): SIM, qemu-system-arm

Testing logs before change:

[Relevant log output showing the "used after free" error during reboot]

Testing logs after change:

[Relevant log output showing the reboot process completing successfully without errors]

nuttxpr avatar Oct 20 '24 15:10 nuttxpr

And the original prefile improvement was only 3 microseconds, which is very minimal, so reverting it is the best approach.

I think, the main argument for the reverted commit (d29748258bc08f2ab285a6537281f203d89ad4b3) is functional safety requirements, not time consumption.

raiden00pl avatar Oct 20 '24 15:10 raiden00pl

@xiaoxiang781216

  1. The optimized time here is 30% (10.65(us) -> 7.35(us)), and the performance of CPUs with different frequencies will be different
  2. This PR https://github.com/apache/nuttx/pull/11850 is mainly to prevent the kernel code to access the memory allocator during task_create/task_terminate, which is part of functional safety requirements. (Our internal branch could disable CONFIG_SIGNAL and CONFIG_MM in nuttx kernel space)
  3. Please tackle the issue head on, this is probably a regression introduced by CONFIG_FS_REFCOUNT, https://github.com/apache/nuttx/pull/13296

https://github.com/apache/nuttx/pull/11850 was merged earlier

anchao avatar Oct 21 '24 00:10 anchao

@xiaoxiang781216

  1. The optimized time here is 30% (10.65(us) -> 7.35(us)), and the performance of CPUs with different frequencies will be different
  2. This PR fs/inode: add pre-allocated task files to avoid allocator access #11850 is mainly to prevent the kernel code to access the memory allocator during task_create/task_terminate, which is part of functional safety requirements. (Our internal branch could disable CONFIG_SIGNAL and CONFIG_MM in nuttx kernel space)
  3. Please tackle the issue head on, this is probably a regression introduced by CONFIG_FS_REFCOUNT, filep Reference count #13296

#11850 was merged earlier

@anchao If the used file descriptor exceeds CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, it will still invoke file_extend and then call the allocator. How can this situation be resolved? By adjusting CONFIG_NFILE_DESCRIPTORS_PER_BLOCK?

This is an issue related to filelist references, not file references.

Can this functionality be made configurable with a default setting to turn it off? Or are there any other good solutions to this problem?

Donny9 avatar Oct 21 '24 02:10 Donny9

If the used file descriptor exceeds CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, it will still invoke file_extend and then call the allocator. How can this situation be resolved? By adjusting CONFIG_NFILE_DESCRIPTORS_PER_BLOCK?

We will ensure that the application code does not exceed CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, so file_extend() is useless.

This is an issue related to filelist references, not file references. Can this functionality be made configurable with a default setting to turn it off? Or are there any other good solutions to this problem?

No, static reservation is a better solution. As long as the nuttx task is created, it needs 3 handles for standard input 、output、 error(0,1,2),So pre-reserve for the file list is a certain event and cannot be skipped, so why not make it static reservation?

anchao avatar Oct 21 '24 02:10 anchao

If the used file descriptor exceeds CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, it will still invoke file_extend and then call the allocator. How can this situation be resolved? By adjusting CONFIG_NFILE_DESCRIPTORS_PER_BLOCK?

We will ensure that the application code does not exceed CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, so file_extend() is useless.

This is an issue related to filelist references, not file references. Can this functionality be made configurable with a default setting to turn it off? Or are there any other good solutions to this problem?

No, static reservation is a better solution. As long as the nuttx task is created, it needs 3 handles for standard input 、output、 error(0,1,2),So pre-reserve for the file list is a certain event and cannot be skipped, so why not make it static reservation?

If that's the case, tasks using a static file list shouldn't exit, now using such a static file list for tasks that need to exit doesn't make sense. However, given the current requirement to enforce the use of such static lists, it would be best if there were options available.

Donny9 avatar Oct 21 '24 04:10 Donny9

If that's the case, tasks using a static file list shouldn't exit, now using such a static file list for tasks that need to exit doesn't make sense. However, given the current requirement to enforce the use of such static lists, it would be best if there were options available.

The static file list could be exited and closed, but the reserved static buffer should not be freed. is it be possible to calculate whether the current filep is in the static file list to decide whether to release the file list?

anchao avatar Oct 21 '24 04:10 anchao

2. This PR [fs/inode: add pre-allocated task files to avoid allocator access #11850](https://github.com/apache/nuttx/pull/11850)  is mainly to prevent the kernel code to access the memory allocator during task_create/task_terminate, which is part of functional safety requirements. (Our internal branch could disable CONFIG_SIGNAL and CONFIG_MM in nuttx kernel space)

is the "functional safety requirements" documented somewhere? or, is it just your internal requirement, which doesn't matter much for the nuttx project?

yamt avatar Oct 21 '24 04:10 yamt

is the "functional safety requirements" documented somewhere? or, is it just your internal requirement, which doesn't matter much for the nuttx project?

  1. We haven't fully passed ISO26262 yet, so most of the changes are not suitable for upstream.
  2. All task creation will be bound to fd 0/1/2, why do we need to allocate for additional memory segment? most kernel tasks will not needs additional file descriptors.

anchao avatar Oct 21 '24 04:10 anchao

is the "functional safety requirements" documented somewhere? or, is it just your internal requirement, which doesn't matter much for the nuttx project?

1. We haven't fully passed ISO26262 yet, so most of the changes are not suitable for upstream.

i don't understand how ISO26262 is related here. but i guess you meant "not ready for upstream yet, but maybe in future", right? if it isn't ready, maybe it's an option to keep this in your internal branch along with the rest of the necessary changes for now?

2. All task creation will be bound to fd 0/1/2, why do we need to allocate for additional memory segment?

it seems to have an obvious downside.

  • waste memory when the file lists are extended
  • waste memory if CONFIG_NFILE_DESCRIPTORS_PER_BLOCK is too big

that is, when the workload on the system is not static as yours.

also, the associated code/structural complexity (which might caused the crash @Donny9 was observing, i dunno) can be considered as another downside.

most kernel tasks will not needs additional file descriptors.

while i agree about the kernel threads, this stuff is not limited to kernel threads, is it?

(well, actually i personally tend to think kernel threads should not use file descriptors at all. but i guess it's a separate topic.)

yamt avatar Oct 21 '24 05:10 yamt

it seems to have an obvious downside.

  • waste memory when the file lists are extended
  • waste memory if CONFIG_NFILE_DESCRIPTORS_PER_BLOCK is too big

that is, when the workload on the system is not static as yours.

I think you should read this part of the code further. PR #11850 just replace dynamic allocation to static definition. It does not waste any memory.

also, the associated code/structural complexity (which might caused the crash @Donny9 was observing, i dunno) can be considered as another downside.

This flaw should be solved in a more elegant fix instead of revert.

anchao avatar Oct 21 '24 05:10 anchao

it seems to have an obvious downside.

  • waste memory when the file lists are extended
  • waste memory if CONFIG_NFILE_DESCRIPTORS_PER_BLOCK is too big

that is, when the workload on the system is not static as yours.

I think you should read this part of the code further. PR #11850 just replace dynamic allocation to static definition. It does not waste any memory.

it makes "struct filelist" larger, which consumes memory.

yamt avatar Oct 21 '24 05:10 yamt

it makes "struct filelist" larger, which consumes memory.

But task_create will also request the same memory, I just moved it to tcb->group by static.

anchao avatar Oct 21 '24 06:10 anchao

is the "functional safety requirements" documented somewhere? or, is it just your internal requirement, which doesn't matter much for the nuttx project?

@yamt why do you think that functional safety doesn't matter much for NuttX? I think this is a big deal for this project. NuttX is not an "IoT-only RTOS", many applications can benefit from meeting functional safety requirements.

BTW. dynamic files allocation is a relatively new feature (Apache era) which was also not fully discussed during its introduction and at that time introduced a difficult-to-detect bug (https://github.com/apache/nuttx/issues/6012)

raiden00pl avatar Oct 21 '24 07:10 raiden00pl

is the "functional safety requirements" documented somewhere? or, is it just your internal requirement, which doesn't matter much for the nuttx project?

@yamt why do you think that functional safety doesn't matter much for NuttX? I think this is a big deal for this project. NuttX is not an "IoT-only RTOS", many applications can benefit from meeting functional safety requirements.

i just asked what "functional safety requirements" is because i had no idea what it is. i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

yamt avatar Oct 21 '24 07:10 yamt

it makes "struct filelist" larger, which consumes memory.

But task_create will also request the same memory, I just moved it to tcb->group by static.

if all tasks have a few open files, maybe. however, isn't it allowed to have tasks w/o open files?

yamt avatar Oct 21 '24 07:10 yamt

As the convention, we could hold the revert for several days if someone could provide a fix, otherwise the revert has to been merged.

xiaoxiang781216 avatar Oct 21 '24 07:10 xiaoxiang781216

is the "functional safety requirements" documented somewhere? or, is it just your internal requirement, which doesn't matter much for the nuttx project?

@yamt why do you think that functional safety doesn't matter much for NuttX? I think this is a big deal for this project. NuttX is not an "IoT-only RTOS", many applications can benefit from meeting functional safety requirements.

i just asked what "functional safety requirements" is because i had no idea what it is. i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

is this what people talking about here? https://www.iso.org/obp/ui/en/#iso:std:iso:26262:-1:ed-2:v1:en

yamt avatar Oct 21 '24 07:10 yamt

i just asked what "functional safety requirements" is because i had no idea what it is. i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

OK, that make sense :) Functional safety is a requirement in some applications (most obviously automotive) to avoid critical faults that can cause fatal results (like a system error causing the user to die). From a dev's point of view, this often comes down to eliminating dynamic allocation during system runtime (so we never use free), or eliminating it completely (in the case of NuttX, only the first option is probably possible).

Functional safety opens up the possibility for NuttX to be used in many other industries

raiden00pl avatar Oct 21 '24 07:10 raiden00pl

is the "functional safety requirements" documented somewhere? or, is it just your internal requirement, which doesn't matter much for the nuttx project?

@yamt why do you think that functional safety doesn't matter much for NuttX? I think this is a big deal for this project. NuttX is not an "IoT-only RTOS", many applications can benefit from meeting functional safety requirements.

i just asked what "functional safety requirements" is because i had no idea what it is. i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

is this what people talking about here? https://www.iso.org/obp/ui/en/#iso:std:iso:26262:-1:ed-2:v1:en

yes, ISO 26262 is an automotive-specific standard, more general standard is IEC 61508

raiden00pl avatar Oct 21 '24 07:10 raiden00pl

i just asked what "functional safety requirements" is because i had no idea what it is. i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

OK, that make sense :) Functional safety is a requirement in some applications (most obviously automotive) to avoid critical faults that can cause fatal results (like a system error causing the user to die). From a dev's point of view, this often comes down to eliminating dynamic allocation during system runtime (so we never use free), or eliminating it completely (in the case of NuttX, only the first option is probably possible).

Functional safety opens up the possibility for NuttX to be used in many other industries

ok. thank you for explanation.

yamt avatar Oct 21 '24 08:10 yamt