nuttx icon indicating copy to clipboard operation
nuttx copied to clipboard

SMP: fix crash when switch to new task which is still running

Open GUIDINGLI opened this issue 2 years ago • 8 comments

Summary

SMP: fix crash when switch to new task which is still running

Situation:

Assume we have 2 cpus, and busy run task0.

CPU0 CPU1 task0 -> task1 task2 -> task0

  1. remove task0 form runninglist
  2. take task1 as new tcb
  3. add task0 to blocklist
  4. clear spinlock 4.1 remove task2 form runninglist 4.2 take task0 as new tcb 4.3 add task2 to blocklist 4.4 use svc ISR swith to task0 4.5 crash
  5. use svc ISR swith to task1

Fix: Move clear spinlock to the end of svc ISR

Signed-off-by: ligd [email protected]

Impact

SMP

Testing

VELA

GUIDINGLI avatar Aug 11 '22 10:08 GUIDINGLI

@masayuki2009 Could you help to review & test this ?

GUIDINGLI avatar Aug 11 '22 10:08 GUIDINGLI

@GUIDINGLI

Let me try this PR later.

masayuki2009 avatar Aug 11 '22 10:08 masayuki2009

@GUIDINGLI It seems that this PR affects all CPU architectures which support SMP. However, you modified ARM only. Please test this PR with all CPU architectures including sim.

masayuki2009 avatar Aug 11 '22 10:08 masayuki2009

@GUIDINGLI Also, can you provide the test code so that we can reproduce the bug?

masayuki2009 avatar Aug 11 '22 10:08 masayuki2009

@GUIDINGLI Also, can you provide the test code so that we can reproduce the bug?

start a hello world main like this. while(1);

just busy loop one cpu, then do anything in other thread.

GUIDINGLI avatar Aug 11 '22 13:08 GUIDINGLI

@GUIDINGLI Also, can you provide the test code so that we can reproduce the bug?

start a hello world main like this. while(1);

just busy loop one cpu, then do anything in other thread.

@GUIDINGLI

I tried your test case with both sabre-6quad:smp (CONFIG_SMP_NCPUS=2) and rv-virt:smp64 (CONFIG_SMP_NCPUS=2) on qemu-6.2 but I couldn't reproduce the bug.

Can you reproduce the issue with the above environments?

masayuki2009 avatar Aug 12 '22 02:08 masayuki2009

ALL the arch.

GUIDINGLI avatar Aug 12 '22 14:08 GUIDINGLI

ALL the arch.

@GUIDINGLI Hmm, it seems that sim:smp stops with DEBUGASSERT

$ ./nuttx 
[CPU0] up_assert: Assertion failed CPU0 at file:irq/irq_csection.c line: 348 task: loop_task
[CPU0] up_assert: Assertion failed CPU0 at file:irq/irq_csection.c line: 348 task: loop_task
[CPU0] up_assert: Assertion failed CPU0 at file:irq/irq_csection.c line: 348 task: loop_task
[CPU0] up_assert: Assertion failed CPU0 at file:irq/irq_csection.c line: 348 task: loop_task
[CPU0] up_assert: Assertion failed CPU0 at file:irq/irq_csection.c line: 348 task: loop_task

masayuki2009 avatar Aug 12 '22 23:08 masayuki2009

@GUIDINGLI

Can you attach both defconfig and .config that you can reproduce the issue?

masayuki2009 avatar Aug 16 '22 23:08 masayuki2009

I meet another problem related to this, I'm work on it.

GUIDINGLI avatar Aug 17 '22 04:08 GUIDINGLI

@GUIDINGLI do we need to convert this change to draft while you are investigating the new issue you met?

pkarashchenko avatar Aug 18 '22 18:08 pkarashchenko

This should be test base on: https://github.com/apache/incubator-nuttx/pull/7015

GUIDINGLI avatar Sep 06 '22 13:09 GUIDINGLI

@masayuki2009 please help to review

GUIDINGLI avatar Sep 06 '22 13:09 GUIDINGLI

@masayuki2009 could you test with your hardware?

xiaoxiang781216 avatar Sep 07 '22 04:09 xiaoxiang781216

@GUIDINGLI Can you reproduce the crash with QEMU?

masayuki2009 avatar Sep 07 '22 04:09 masayuki2009

Can you reproduce the crash with QEMU?

@GUIDINGLI Also, please rebase this PR to the latest master.

masayuki2009 avatar Sep 07 '22 05:09 masayuki2009

Can you reproduce the crash with QEMU?

@GUIDINGLI Also, please rebase this PR to the latest master.

done

GUIDINGLI avatar Sep 07 '22 09:09 GUIDINGLI

@GUIDINGLI Can you reproduce the crash with QEMU?

I have tried on sabre-6quad:netnsh_smp, can't reproduce this bug. And tried set NCPUS=2, close RR_INTERVAL, set CONFIG_SYSTEM_NSH_PRIORITY=255, But can't reproduce.

GUIDINGLI avatar Sep 07 '22 14:09 GUIDINGLI

I have tried on sabre-6quad:netnsh_smp, can't reproduce this bug. And tried set NCPUS=2, close RR_INTERVAL, set CONFIG_SYSTEM_NSH_PRIORITY=255, But can't reproduce.

@GUIDINGLI Hmm, is it true that you can still reproduce the issue with VELA? If so, please attach both of defconfig and .config so that we can look into them.

Also, I noticed that qemu-a53:nsh_smp does not work with this PR.

masayuki2009 avatar Sep 07 '22 22:09 masayuki2009

qemu-a53:nsh_smp has been fixed.

Also correct the risc-v & mips arch.

is it true that you can still reproduce the issue with VELA? -I will do the double check.

GUIDINGLI avatar Sep 08 '22 14:09 GUIDINGLI

config.txt

@masayuki2009 here is the .config

GUIDINGLI avatar Sep 09 '22 08:09 GUIDINGLI

@GUIDINGLI Currently, armv8-m and mips32 do not support SMP. So I think the following file should not be changed.

arch/arm/src/armv8-m/arm_doirq.c arch/mips/src/mips32/mips_doirq.c arch/mips/src/pic32mx/pic32mx_decodeirq.c arch/mips/src/pic32mz/pic32mz_decodeirq.c

masayuki2009 avatar Sep 12 '22 00:09 masayuki2009

Current not support, maybe support in future. So it is better modified all.

GUIDINGLI avatar Sep 13 '22 04:09 GUIDINGLI

Current not support, maybe support in future. So it is better modified all.

@GUIDINGLI If so, how about other CPU architectures such as x86_64 and sparc? In my personal opinion, this PR should focus on the currently available SMP platforms.

masayuki2009 avatar Sep 13 '22 05:09 masayuki2009

let's add all arch? Otherwise people may forget to apply this change to other arch if they implement SMP for that arch someday.

xiaoxiang781216 avatar Sep 13 '22 15:09 xiaoxiang781216

@masayuki2009 @xiaoxiang781216 Now we only need modify the arch which already support SMP. So, I remove the mips & add ceva.

GUIDINGLI avatar Sep 16 '22 16:09 GUIDINGLI

LGTM, let's ignore the false alarm for macOS CI.

xiaoxiang781216 avatar Sep 16 '22 17:09 xiaoxiang781216