tinyusb icon indicating copy to clipboard operation
tinyusb copied to clipboard

nrf5x: Use stdatomic to fix race conditions

Open kasjer opened this issue 2 years ago • 8 comments

Current code did not prevent race condition if transfer was started from non-usb interrupt. testing and modifying dma_running flag was not nested-interrupt-proof.

Describe the PR This uses standard C library atomic functions to access dma_running flag. For Cortex-M4 (NRF52) and Cortex-M33 (NRF53) compiler generates code that uses LDREX/STREX instructions along with memory barriers.

Additional context Issue was detected during BLE throughput tests that also stressed USB. It happens when large number of USB OUT packets mixes with event USB IN interrupt transfers. USB IN transfers are started in other (non-USB interrupt) and sometimes it happens that those interrupts preempt USB interrupt in the middle of USB OUT transfer handling.

kasjer avatar Mar 22 '22 18:03 kasjer

(I'm just curious why this fix was needed in the first place- it looks fine to me)

What was the code generated for setting/clearing dma_running originally, and why did it cause a race? Does a non-atomic bool on ARM require more than one insn (one to generate the constant, another to do the store)?

cr1901 avatar Mar 23 '22 14:03 cr1901

@cr1901 thanks for asking now that I looked again there is just one line to be removed and then it will be clear. Problem with current code is that place that checks whether dma is currently running is separated from place that sets this flag. It takes some measures to ensure that change and check is consistent but when we have two interrupts involved checking in interrupt that dma is not running is not enough since after check another interrupt can check again if dma is running and can see that it is not hence it will start DMA and as soon as interrupt is finished previous interrupt that checked flag start DMA again causing unpredictable behavior.

Proposed change tries to set dma flag only if it was not set. If this fail it means that dma is running. LDREX/STREX behind atomic functions will ensure that even if interrupt fires in between checking and setting everything will work as expected.

I will push one more commit that will show were flag was set and then it will be more clear why this has to updated

kasjer avatar Mar 23 '22 14:03 kasjer

@kasjer Thank you for the clarification, now it makes sense :D!

cr1901 avatar Mar 23 '22 15:03 cr1901

thank @kasjer very much, I actually tried something similar previously with exactly atomic_bool, and expect compiler to generate LDREX/STREX. But somehow I failed mysteriously :face_with_head_bandage: and didn't have the time to investigate. Since I really need to fix the race and switch works. I am glad that finally you could address this, could you share how to do the ble throughput test so that I could help

PS: hmm, last time I try to use atomic to also replace mutex as well, do you think we could remove mutex now. I kind of forgot how these works temporarily.

hathach avatar Mar 29 '22 11:03 hathach

@kasjer Thank you for the clarification, now it makes sense :D!

LDREX/STREX on M3+ is noisr-free to achieve mutex on multiple cores or preempted RTOS. Other methods are often resourced to disable_isr() to get mutex. Note atomic bool on unsupported MCU such as M0 will cause disable_isr(). But what do I know, I failed miserably last time I tried to apply it exactly this :smile: #662 for more reference

hathach avatar Mar 29 '22 11:03 hathach

@hathach I think that it's possible that edpt_dma_start() could be reduced to

static void edpt_dma_start(volatile uint32_t* reg_startep)
{
    if (atomic_flag_test_and_set(&_dcd.dma_running))
    {
      //use usbd task to defer later
      usbd_defer_func((osal_task_func_t) edpt_dma_start, (void*) (uintptr_t) reg_startep, true);
    }else
    {
      start_dma(reg_startep);
    }
}

...deferring DMA start in every case that DMA seems to be still running. There was still another race when following sequence was interrupted

  xfer->buffer     = buffer;
  xfer->total_len  = total_bytes;
  xfer->actual_len = 0;

and total_len and actual_len were inconsistent in interrupt handler hence current interrupt disable during this and another mutex lock to make sure that interrupt is not enabled by other task.

If you plan changes keep in mind those scenarios. If I will have time I can also check path but I will have to wait a bit.

@hathach Currently testing application is property of our client and is not based on bth_device but on vendor interface. Before vendor interface was working we did initial implementation on bth_device. We do have permission to have testing framework open sourced in regard to bth_device (not the vendor part). Once we have clear separation of what can be shared and what not we will make it public. I will let you know when this happens so you can use it for testing NRF52840, NRF5340 and DA1469x.

kasjer avatar Mar 29 '22 13:03 kasjer

@kasjer I remembered doing exactly just that and also apply for the thread-mode case (below) as well , but it did not work out.

https://github.com/hathach/tinyusb/blob/edd8eb3279c2440e9d4590312f2104e58beafe12/src/portable/nordic/nrf5x/dcd_nrf5x.c#L156-L165

However, there are a few race fixed since then. It could have been caused by other bugs and/or I made my own mistakes. I would love to have the tests, however, if it is too troublesome, don't worry I figure out my way by pulling out one of previous race tests that we fixed.

...deferring DMA start in every case that DMA seems to be still running. There was still another race when following sequence was interrupted

  xfer->buffer     = buffer;
  xfer->total_len  = total_bytes;
  xfer->actual_len = 0;

and total_len and actual_len were inconsistent in interrupt handler hence current interrupt disable during this and another mutex lock to make sure that interrupt is not enabled by other task.

If you plan changes keep in mind those scenarios. If I will have time I can also check path but I will have to wait a bit.

Yeah, thanks for pointing this out, this is probably the issue I encountered previously. Hopefully we could finally fix this race one and for all

hathach avatar Mar 29 '22 14:03 hathach

@hathach It took a while but we now have opensource throughput test application for nimble that can be also used to with USB transport for NRF52840/NRF5340/DA1469x.

With this software it's possible to trigger race condition that this PR fixes. Testing application is part of nimble and is written in python and it runs on Linux.

I was able to reproduce problem only on NRF5340 where bluetooth stack runs on separate core.

This is how to test: get go compiler if not installed

sudo apt install golang

get latest newt tool from repository, current release version can have some flows.

git clone [email protected]:apache/mynewt-newt.git
cd mynewt-newt
./build.sh
cd ..

use tool produce by step above to create mynewt project

mynewt-newt/newt/newt new my_project
cd my_project

add TinyUSB repository to the project and switch to master branches (tagged version do not contain everything)

cat >project.yml <<END
project.name: "my_project"

project.repositories:
    - apache-mynewt-core

repository.apache-mynewt-core:
    type: github
    vers: 0.0.0
    user: apache
    repo: mynewt-core

repository.tinyusb:
    type: github
    vers: 0.0.0
    user: hathach
    repo: tinyusb
END
# following line should not be needed but due to some errors tinyusb is cloned separately
git clone --depth=1 [email protected]:hathach/tinyusb.git repos/tinyusb
../mynewt-newt/newt/newt upgrade

Nimble do have targets.

../mynewt-newt/newt/newt build -q @apache-mynewt-nimble/tools/hci_throughput/targets/nordic_pca10095_app_boot
../mynewt-newt/newt/newt build -q @apache-mynewt-nimble/tools/hci_throughput/targets/nordic_pca10095_net_boot
../mynewt-newt/newt/newt create-image -q @apache-mynewt-nimble/tools/hci_throughput/targets/nordic_pca10095_app_blehci timestamp

Load binaries to the board

../mynewt-newt/newt/newt load @apache-mynewt-nimble/tools/hci_throughput/targets/nordic_pca10095_app_boot
../mynewt-newt/newt/newt load @apache-mynewt-nimble/tools/hci_throughput/targets/nordic_pca10095_app_blehci
../mynewt-newt/newt/newt load @apache-mynewt-nimble/tools/hci_throughput/targets/nordic_pca10095_net_boot

Once build and connected to Linux board should be visible as USB Bluetooth controller. To verify that everything went correct hciconfig command can be used

hciconfig

that should have output like this. In this example hci0 is existing bluetooth controller from PC, hci1 is newly connected NRF5340. It PC does not have any bluetooth controller it is possible to connect two NRF5340 or one NRF52840 board.

hci1:	Type: Primary  Bus: USB
	BD Address: 00:00:00:00:00:00  ACL MTU: 255:80  SCO MTU: 0:0
	DOWN 
	RX bytes:5149 acl:0 sco:0 events:22 errors:0
	TX bytes:537475 acl:2108 sco:0 commands:34 errors:1

hci0:	Type: Primary  Bus: USB
	BD Address: 8C:C6:81:69:59:D7  ACL MTU: 1021:4  SCO MTU: 96:6
	DOWN 
	RX bytes:56663963 acl:0 sco:0 events:3288 errors:0
	TX bytes:770690 acl:0 sco:0 commands:3238 errors:0

Make sure python requirements are met

cd repos/apache-mynewt-nimble/tools/hci_throughput
sudo pip install -r requirements.txt

Running throughput testes python application. To trigger race NRF5340 must be transmitting a lot of ACL data. In line below 0 and 1 mean hci0 and hci1 and rx tx mean that hci0 will be receiving and hci1 will be transmitting. Tool requires config.yaml file. There is config.yaml.sample that while is fine may run test to short to trigger problem, hence awk line that increase number of packets to send.

awk <config.yaml.sample -e '
/num_of_packets_to_send/ { print "num_of_packets_to_send: 100000"; next }
{ print }
' >config.yaml
sudo python3 main.py -i 0 1 -m rx tx -cf config.yaml

If everything worked fine btmon can be used to see BLE traffic

sudo btmon

Without changes in this PR it's very likely that device will reboot due to assert and then traffic will stop. Then hci1 will vanish and hci2 will pop up due to hci1 device being used by python.

I hope that will convince you.

Then we have #1474 that needs another change and more complicated setup but it can be reproduced on NRF52840 as well.

kasjer avatar May 27 '22 16:05 kasjer

superseded by #1489 . Thank you very much for the PR and sorry for late response.

hathach avatar Aug 15 '22 14:08 hathach