tinyusb icon indicating copy to clipboard operation
tinyusb copied to clipboard

DWC2 DMA support

Open HiFiPhile opened this issue 10 months ago • 21 comments

Describe the PR Add Internal DMA support to DWC2.

Status

Model Core Rev Enumeration cdc_ual_ports audio_4_channel_mic video_capture_2ch
STM32F407
FS w/wo DMA
2.81a :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
STM32F429
FS w/wo DMA
2.81a :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
STM32F723
FS + HS w/wo DMA
3.30a :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
STM32U5
HS w/wo DMA
4.11a :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Model Core Rev Enumeration cdc_msc_freertos audio_4_channel_mic_freertos hid_composite_freertos
ESP32-S3
FS w/wo DMA
4.00a :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:

Benchmark

STM32F723E-DISCO IAR 9.50 High-Balanced I-Cache enabled

Method used in #920, all buffers set to 2048b.

Throughput Total CPU Usage
DMA OFF 26.7 MB/s 55%
DMA ON 29.5 MB/s 26%

HiFiPhile avatar Apr 05 '24 20:04 HiFiPhile

Setup is received, somehow Data packet ending is corrupted...

USBD init on controller 1
sizeof(usbd_device_t) = 44
sizeof(dcd_event_t) = 12
sizeof(tu_fifo_t) = 12
sizeof(tu_edpt_stream_t) = 24
CDC init
guid, gsnpsid, ghwcfg1, ghwcfg2, ghwcfg3, ghwcfg4
0x00001100, 0x4F54281A, 0x00000000, 0x229ED590, 0x03F403E8, 0x17F00030
Fullspeed PHY init
USBD Bus Reset : Full Speed

USBD Setup Received 80 06 00 01 00 00 40 00
  Get Descriptor Device
  Queue EP 80 with 18 bytes ...
  0000:  12 01 00 02 EF 02 01 40 FE CA 02 40 00 01 01 02  |.......@...@....|
  0010:  03 01                                            |..|

image

HiFiPhile avatar Apr 05 '24 20:04 HiFiPhile

I've tried this PR in my project directly. I'm using an stm32f722ze, OTG_HS peripheral with internal FS phy, a single CDC interface with nothing of note in terms of config. ep0 is 64B, CDC bulk are also 64B. I'm using GCC12.3 with -O0 and only ICACHE enabled.

Unfortunately, as is, the PR did not work. During enumeration a broken ep0 xfer corrupts a big chunk of the sram through the dma controller. I am also using the MPU, so I know it's not the CPU that does this: image

Please let me know if there is anything I can provide in terms of logs that could help in pinpointing why this is happening.

leptun avatar May 05 '24 14:05 leptun

I tried catching the moment where the xfer length gets corrupted. I added this. It's probably not correct, however the CDC does seem to work correctly now and the USB enumerates properly. image

leptun avatar May 05 '24 14:05 leptun

I tried catching the moment where the xfer length gets corrupted. I added this. It's probably not correct, however the CDC does seem to work correctly now and the USB enumerates properly.

Humm that's interesting... Just checked ghwcfg registers they are the same as F723, so the only difference is F723 has an internal HS phy.

However corrupted setup packet shouldn't be caused by this, could you publish full project ?

HiFiPhile avatar May 05 '24 14:05 HiFiPhile

The epout interrupt seems to be triggering twice. Example:

with this PR:

USBD Setup Received 80 06 00 02 00 00 FF 00 
  Get Descriptor Configuration[0]
  Queue EP 80 with 64 bytes ...
USBD Xfer Complete on EP 80 with 64 bytes
  Queue EP 80 with 11 bytes ...
USBD Xfer Complete on EP 80 with 11 bytes
  Queue EP 00 with 0 bytes ...
USBD Xfer Complete on EP 00 with 0 bytes
USBD Xfer Complete on EP 00 with 0 bytes

with latest master:

USBD Setup Received 80 06 00 02 00 00 FF 00 
  Get Descriptor Configuration[0]
  Queue EP 80 with 64 bytes ...
USBD Xfer Complete on EP 80 with 64 bytes
  Queue EP 80 with 11 bytes ...
USBD Xfer Complete on EP 80 with 11 bytes
  Queue EP 00 with 0 bytes ...
USBD Xfer Complete on EP 00 with 0 bytes

the underflowed xfer length would make sense if the subtract is done twice.

And yes, I can make the project public. I'll get that sorted soon.

leptun avatar May 05 '24 14:05 leptun

The epout interrupt seems to be triggering twice

For all EPs or only EP0 ?

HiFiPhile avatar May 05 '24 15:05 HiFiPhile

I think it's only EP0 that does this, but I'm not 100% sure. Here is a complete log:

dwc2_dma.txt

leptun avatar May 05 '24 15:05 leptun

@HiFiPhile https://github.com/leptun/Avionus_Firmware/tree/tusb_dma Here is the code that I am using. I've created a branch where I've disabled pretty much anything that is unrelated to usb. You should be able to get the log on the SWO trace. I am also initializing almost no peripheral, so you can probably run it on the stm32f723 eval board. It's using the HSE at 24MHz. You might have to change that in App/config.hpp. Let me know if this also triggers the bug for you. To build the project you need to initialize the git submodules. I'm using STM32cubeide 1.15.0 with the GNU Tools for STM32 (11.3.rel1) toolchain. I was mistaken in my previous post when I said I use GCC 12.3. Do not regenerate the cubemx code since that will enable stuff again.

leptun avatar May 05 '24 15:05 leptun

Just a quick test, what if you put tusb memory in DTCM in linker script ?

I am also initializing almost no peripheral, so you can probably run it on the stm32f723 eval board.

Unfortunately not, USB phy structures are completely different.

HiFiPhile avatar May 05 '24 16:05 HiFiPhile

I'll try that when I get back home later today. Do you want everything tinyusb related in dtcmram?

leptun avatar May 05 '24 16:05 leptun

Do you want everything tinyusb related in dtcmram?

Yes please

HiFiPhile avatar May 05 '24 16:05 HiFiPhile

Just tried. I moved the krpc (the thing using the CDC) and tinyusb at the beginning of DTCMRAM. Still same behaviour. I pushed the changes.

leptun avatar May 05 '24 16:05 leptun

Humm.. seems out of RAM ,as HS needs more buffer:

C:/ST/STM32CubeIDE_1.15.1/STM32CubeIDE/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.11.3.rel1.win32_1.1.200.202402032213/tools/bin/../lib/gcc/arm-none-eabi/11.3.1/../../../../arm-none-eabi/bin/ld.exe:E:\mcu\stm32f7\Avionus_Firmware\STM32F722ZETX_FLASH.ld:370 cannot move location counter backwards (from 000000002003ec68 to 000000002003e800)
collect2.exe: error: ld returned 1 exit status
make: *** [makefile:155: Avionus_Firmware.elf] Error 1

HiFiPhile avatar May 05 '24 17:05 HiFiPhile

Oh. Yes. I allocated only 2KB to tinyusb. I've pushed a commit that increases the size to 4KB. If you need more than that the linker script needs to be reordered (swap the position of the krpc section and the tinyusb section since the memory segments need to be also aligned to their size.)

leptun avatar May 05 '24 18:05 leptun

Just tested the fix in 02ec486 and it solved the issue on my end as well 👍 Also verified the dual-CDC example on a stm32f746-disco board with the HS ULPI phy and that enumerates as well

leptun avatar May 05 '24 20:05 leptun

Does it resolve first all zero setup packet ? (This one is only on your project)

Can't get SWO to work so I've added RTT to your project, and suddenly RTT stop to output...

Maybe there is still some glitch somewhere, enumerate is not smooth as stock example, it start with one incomplete transfer. image

HiFiPhile avatar May 05 '24 20:05 HiFiPhile

Huh, I did notice it enumerated slower, but I thought that was just my PC acting up since I have tons of endpoints active on the same root hub. But you are right, there are more bus resets than there should be. I don't think I noticed this behaviour on the disco with the HS phy. This is what it looks like now on the logs: dwc2_dma.txt Weird that SWO didn't work.

leptun avatar May 05 '24 20:05 leptun

Ha ! Endpoint transfer is queued before it's opened, seems in DMA mode it's more sensitive.

Maybe we will add more assertion.

 krpc_error_t KrpcClient::krpc_close() {
-	tud_cdc_n_read_flush(0);
-	tud_cdc_n_write_clear(0);
+	if(tud_ready()){
+		tud_cdc_n_read_flush(0);
+		tud_cdc_n_write_clear(0);
+	}
 	return KRPC_OK;
 }

HiFiPhile avatar May 05 '24 20:05 HiFiPhile

Heh, nice catch! Some assertion for this scenario might be helpful, though I'm not sure if this should go in the CDC class or in the usbd driver. Thanks a lot for helping out with getting this PR working with my project. If I notice any other issue with it, I'll let you know :)

Now I guess it's time for me to go back to finding out why the performance of stm32_fsdev is attrocious. Might be related to the fact that the data is copied around 3 or 4 times by the CPU, might be because double buffering is not implemented for bulk endpoints... If you're curious what that is for, I am working on an AVR ISP programmer with integrated USB-UART as a second virtual com port. The TTL part seemed to be going well, till I realized that I couldn't push more than ~70KB/s of CDC data in a loopback on the same interface. I tried using the xfer_ff in the CDC so that one of the copies is avoided, but that didn't help at all. I really hope I won't have to get double buffering working.

leptun avatar May 05 '24 21:05 leptun

I realized that I couldn't push more than ~70KB/s of CDC data in a loopback on the same interface. I tried using the xfer_ff in the CDC so that one of the copies is avoided, but that didn't help at all. I really hope I won't have to get double buffering working.

That's really bad, I got more than 500KB/s in CDC OUT transfer with a G0B1, which makes me think double buffer is not necessary. Besides it's really a pain to manage double buffer in fsdev.

Let's continue in a discussion.

HiFiPhile avatar May 05 '24 22:05 HiFiPhile

Thanks, but no need anymore. My testing setup was bad. With your set_test.py script and some modifications to also rx at the same time, I can now loopback at ~560KB/s, which is more than enough

leptun avatar May 05 '24 22:05 leptun

Hi @HiFiPhile , May I ask how did you measure the total CPU load?

roma-jam avatar Sep 18 '24 10:09 roma-jam

thank you very much @HiFiPhile again for another great PR. sorry for super late response, I am doing more test with this, seems like it have some issue with my h743 eval which uses external HS PHY. That is ok, I am doing more testing and reading hopefully we could fix and merge this soon.

hathach avatar Sep 18 '24 14:09 hathach

seems like it have some issue with my h743 eval which uses external HS PHY

Thanks for testing, it's the combination I'm missing.

May I ask how did you measure the total CPU load?

@roma-jam In short I'm using GPIO toggle in ISR and in event queue processing, then take the averaged sum of 10 cycles.

HiFiPhile avatar Sep 18 '24 15:09 HiFiPhile