tinyusb
tinyusb copied to clipboard
Rewrite of NCM device driver
Describe the PR This is a rewrite of the original NCM device driver. You can check tests concerning functionality and performance here.
Additional context
Buggyness of the current driver can be easily verified with for MSS in 100 200 400 800 1200 1450 1500; do iperf -c 192.168.14.1 -e -i 1 -M $MSS -l 8192 -P 1; sleep 2; done
.
I'm using the driver in my own project. The USB device is running together with Linux/Win10 without any problems.
This solves #2068.
Hi, sorry for huge delay. I'm not an expert of NCM class but I'd like to move the things on.
Compared to RNDIS current NCM implementation is problematic, and new implementation doesn't perform well at high MSS on Linux, with STM32F723:
MSS vs Mbit/s | RNDIS | NCM with #2564 | NCM New |
---|---|---|---|
100 | 7.25 | 7.04 | 10.2 |
200 | 14.5 | 6.95 | 12.5 |
400 | 21.4 | 6.75 | 6.71 |
800 | 34.8 | 6.97 | 8.02 |
1450 | 34.8 | 8.17 | 8.00 |
On the bus I observed high delay between OUT and IN transfer, also between 2 OUT transfers for both NCM and NCM New.
It could also be a linux tweaking issue, on Windows 10 I got a throughput of ~30Mbps without these high delay.
I'll try to review your implementation, below is the USB capture in case if you're interested. ncm_new.zip
On the bus I observed high delay between OUT and IN transfer, also between 2 OUT transfers for both NCM and NCM New. It could also be a linux tweaking issue, on Windows 10 I got a throughput of ~30Mbps without these high delay
Hello HiFiPhile
very interesting observation. Could you do the little benchmark comparison also with Win10.
very interesting observation. Could you do the little benchmark comparison also with Win10.
It seems like iperf on Windows doesn't support MSS adjustment, I see no difference in captured packet.
On Windows there is only one TCP ACK packet each time instead of 2, also delay between transfer is reduced. I got a throughput of 24Mbps.
Here is the USB capture : ncm_enum win.zip
One thing strange is on USB bus ACK seq number is shifted from data packet, or maybe it's a decoding issue.
Here is the Netif capture on Linux: netif.zip
Hmmm... don't know what to do here. Unfortunately I don't have any Hi-Speed device in my drawer, so incapable of checking the above.
At least I have a project using NCM which also works in conjunction with Win10. Speed is comparable. My major concern was compatibility.
I was able to massively boost the performance simply by increasing TCP_WND
!
MSS vs Mbps | default | x2 | x4 |
---|---|---|---|
100 | 10.2 | 10.3 | 10.4 |
200 | 12.5 | 18.9 | 19.2 |
400 | 6.71 | 24.9 | 32 |
800 | 8.02 | 12.5 | 46.6 |
1450 | 8.00 | 14.8 | 70.4 |
MSS=1450 comparaison
- TCP_WND = TCP_MSS
Instead of one complete datagram 2 segments are sent, both are acked, then high delay (~1ms) until next transfer.
- TCP_WND = 2*TCP_MSS
Two complete datagrams are sent, both are acked, then high delay (~1ms) until next transfer.
- TCP_WND = 4*TCP_MSS
Two complete datagrams are sent, only 1 ack, ack delay is a little bit higher maybe for processing, no delay until next transfer.
It makes me to believe there must be some relationships between TCP window and NTB size to make it works better, so I decreased NTB size to 1600 while keeping TCP_WND = 2*TCP_MSS:
MSS vs Mbps | default | x2 | x4 | x2; NTB 1600 |
---|---|---|---|---|
100 | 10.2 | 10.3 | 10.4 | 10.7 |
200 | 12.5 | 18.9 | 19.2 | 20.6 |
400 | 6.71 | 24.9 | 32 | 38.4 |
800 | 8.02 | 12.5 | 46.6 | 44.8 |
1450 | 8.00 | 14.8 | 70.4 | 68 |
2 Packets are sent and 2 ack, no delay until next transfer.
Below are all USB captures: tcp_wnd_test.zip
Bonus: With #2576 applied, I got 116Mbps @ MSS=1450, TCP_WND = 4*TCP_MSS !
Wow... those are great numbers. Out of curiosity: did you test different TCP_WND with RNDIS as well?
And the other point: anything I can do on the driver side to enhance throughput?
Out of curiosity: did you test different TCP_WND with RNDIS as well?
Here is RNDIS on Arch Linux:
MSS vs Mbps | default | x2 | x4 |
---|---|---|---|
100 | 7.48 | 7.54 | 7.74 |
200 | 14.7 | 14.1 | 14.7 |
400 | 25.6 | 27.8 | 27.2 |
800 | 38.8 | 41.8 | 50.9 |
1450 | 38.8 | 65.4 | 71.9 |
At high MSS throughput is little higher than my previous test... so TCP_WND helps RNDIS as well. On Windows 10 there isn't much difference, always at ~11Mbps @ 1450.
And the other point: anything I can do on the driver side to enhance throughput?
With your understanding is my finding about TCP_WND vs NTB size coherent ? How did you chose the size of 3200 ?
It looks like TCP_WND=2*TCP_MSS and NTB_SIZE=1600 has a good performance and balanced RAM usage, does it works well in your case ?
Meaning, that RNDIS benefits only for MSS=800/1450 (and is still faster than NCM for this cases :-/).
Buffer size: I have taken the 3200 from the original TinyUSB NCM driver. So no thoughts from my side concerning buffer size.
Hmmm... currently checking Zephyr sources. They do not transmit frames from device -> host which are > NET_ETH_MAX_FRAME_SIZE, which is 1500+header_size
Not sure if lwip's iperf server is buggy or something else, reverse throughput doesn't work well, most cases it just doesn't transfer. Also looks like a it always use TCP_MSS as MSS.
MSS vs Mbps | RNDIS | NCM |
---|---|---|
100 | 0 | 0 |
200 | 0 | 0 |
400 | 0 | 0 |
800 | 23 | 16 |
1200 | 23 | 16 |
1450 | 0 | 0 |
NTB is not used efficiently, always cut in half.
Yes, I had the feeling as well and as far as I remember I found a bug in iperf sources but was too busy/lazy to put the fix into their merge machinery.
You can give the attached file a try.
Another point concerning buffer/frame sizes: I'm currently checking Zephyr sources. They do not transmit frames from device -> host which are > NET_ETH_MAX_FRAME_SIZE, which is 1500+header_size
PS: Just found kernel use 2048 as minimal size: https://github.com/torvalds/linux/blob/f03359bca01bf4372cf2c118cd9a987a5951b1c8/include/uapi/linux/usb/cdc.h#L447
https://github.com/torvalds/linux/blob/f03359bca01bf4372cf2c118cd9a987a5951b1c8/drivers/net/usb/cdc_ncm.c#L161
Ah... yes... one more thing: in function_ncm.c there is a definition of up/downlink speed. Never really tried values for tuning but perhaps the Linux NCM driver is smart and checks those values?
PS: Just found kernel use 2048 as minimal size:...
Meaning CONFIG_CDC_NCM_XMT/RCV_NTB_MAX_SIZE > 2048 ?
Never really tried values for tuning but perhaps the Linux NCM driver is smart and checks those values?
Already tried, I set it to 480Mbps and nothing changes.
Meaning CONFIG_CDC_NCM_XMT/RCV_NTB_MAX_SIZE > 2048 ?
I think so, performance is 2-3% lower than 1600 as most parts of last packet is wasted.
I got some major throughput regression updating LWIP to master, @ 800 MSS it went down to 37Mbps from 46 Mbps.
I tried your lwiperf but it's not better, instead of having 0bps iperf just hung.
Log shows:
lwiperf_tcp_poll
...
lwiperf_tcp_poll
lwiperf_tx_start_impl
lwiperf_tcp_client_connected
lwiperf_tcp_client_send_more a
lwiperf_tcp_client_send_more b
lwiperf_tcp_poll
...
lwiperf_tcp_poll
lwiperf_tcp_client_send_more end time
Revert LWIP to 2.1.2, another finding is reduce CFG_TUD_NCM_OUT_NTB_N
and CFG_TUD_NCM_IN_NTB_N
to 1 doesn't affect performance at all and save a lot of RAM.
To resume:
- CFG_TUD_NCM_OUT_NTB_N=1
- CFG_TUD_NCM_IN_NTB_N=1
- CFG_TUD_NCM_IN_NTB_MAX_SIZE=2048
- CFG_TUD_NCM_OUT_NTB_MAX_SIZE=2048
- TCP_SND_BUF=(4 * TCP_MSS)
- TCP_WND=(4 * TCP_MSS)
- PBUF_POOL_SIZE=4
Transmit:
MSS vs Mbps | RNDIS | NCM |
---|---|---|
100 | 7.54 | 10.7 |
200 | 14.6 | 20.8 |
400 | 26.9 | 38.1 |
800 | 46.7 | 66.3 |
1450 | 70.2 | 89.3 |
Receive, don't know why MSS 200 & 400 start to work and 1200 no longer work:
MSS vs Mbps | RNDIS | NCM |
---|---|---|
100 | 0 | 0 |
200 | 8.42 | 6.93 |
400 | 16.8 | 15.1 |
800 | 23.4 | 36.1 |
1450 | 0 | 0 |
For a similar Flash & RAM footprint NCM's overall performance is better.
Interesting... when I implemented the driver, I did a lot of performance testing. And honestly: a lot was also guess work. One of the assumptions was, that having at least two buffers for each RX/TX would be of benefit, because while one frame is sent, the other can be filled (TX) or while a received frame is being handled it should be a good idea to have one concurrently receiving new data.
And I'm almost sure, that on my relatively slow RP2040 that was the case: I'm getting there around 5MBit/s in both directions, so the link is at 50% usage.
For your case with hi-speed, it probably doesn't matter because transfer is not the bottleneck.
Nevertheless, I have the feeling we are hunting some performance issues (which are optimizations and perhaps further to investigate).
Any real reason to not merge the driver?
And I'm almost sure, that on my relatively slow RP2040 that was the case: I'm getting there around 5MBit/s in both directions, so the link is at 50% usage.
Could you make test if current settings works well in your case ?
Edit: I saw some improvement increasing NTBs with Full-Speed on STM32F407.
Nevertheless, I have the feeling we are hunting some performance issues (which are optimizations and perhaps further to investigate). Any real reason to not merge the driver?
Well main reason is lacking of ressource, even my own PR's are missing from review.
I'd like to ensure if everything works smoothly, since we often met the case where original authors are hard to be contacted when something happens, there are pending issues hard to fix.
We are close, once we choose a parameter set I'll do a formating pass then it's ready.
I understand the situation & I'm actually one of those :-/
My last (side) project was based on RP2040 and its SDK which uses TinyUSB et al.
My current (professional) project is based on Zephyr, so chances are that I will drift away from TinyUSB.
Concerning the settings: of course I will test those. Give me some minutes.
Gosh... the number of screws to turn tends to drive me crazy: on my device, if setting CFG_TUD_NCM_*_NTB_MAX_SIZE=2048 then iperf produces retries. This can be circumvented by setting wNtbOutMaxDatagrams=1, but this is actually not the intention. So the 3200 are not randomly selected. But don't ask me, why the 2048 length produces errors.
So here are my results (number=buffer-size, n=x means number of buffers):
MSS vs Mbps | org | 2048/n=1 | 3200/n=1 |
---|---|---|---|
100 | 2.6 | 1.7 | 1.7 |
200 | 3.5 | 2.6 | 2.6 |
400 | 4.1 | 0.5 | 3.4 |
800 | 4.5 | 0.6 | 4.1 |
1450 | 4.7 | 4.5 | 4.5 |
So the current configuration isn't there for no reason, at least with my configuration. I've done also intensive tests with SystemView and outcome was the configuration you can see in the repo.
Perhaps a parameter for wNtbOutMaxDatagrams should be added to allow the regular user as well to enter parameter hell!?
fighting with the parameters and trying to understand... that's the current outcome:
#ifndef CFG_TUD_NCM_IN_NTB_MAX_SIZE
/// must be >> MTU
#define CFG_TUD_NCM_IN_NTB_MAX_SIZE (2 * TCP_MSS + 100) // can be set to 2048 without impact
#endif
#ifndef CFG_TUD_NCM_OUT_NTB_MAX_SIZE
/// must be >> MTU
#define CFG_TUD_NCM_OUT_NTB_MAX_SIZE (2 * TCP_MSS + 100) // can be set to smaller values if wNtbOutMaxDatagrams==1
#endif
Meaning: CFG_TUD_NCM_OUT_NTB_MAX_SIZE is dependent on TCP_MSS while CFG_TUD_NCM_IN_NTB_MAX_SIZE is not. CFG_TUD_NET_MTU=1514 in my case.
For me it is not clear, why CFG_TUD_NCM_OUT_NTB_MAX_SIZE must be twice TCP_MSS (plus some overhead). Is it a host driver issue or what?
Too many TCP parameters...
Perhaps a parameter for wNtbOutMaxDatagrams should be added to allow the regular user as well to enter parameter hell!?
Yes why not.
So the current configuration isn't there for no reason, at least with my configuration. I've done also intensive tests with SystemView and outcome was the configuration you can see in the repo.
If you agree we can set NTB default at (2 * TCP_MSS + 100)
and N=1
with a comment mentioning increasing N may increase performance in trade of RAM usage in some cases.
But don't ask me, why the 2048 length produces errors.
Me too I saw this behavior on FS, maybe a driver thing.
I have added your suggestions. Currently the NCM driver is not built in the CI. Any idea how to add that?
I'll do comment and format things :) Also I have to move #include "lwipopts.h"
to example's tusb_config.h
, the stack itself should have minimal external dependencies.
I'll do comment and format things :) Also I have to move
#include "lwipopts.h"
to example'stusb_config.h
, the stack itself should have minimal external dependencies.
Thank you very much for testing and reviewing.
Thank you for your great effort for this new driver ! Now it's setup.
Thank you for your great effort for this new driver ! Now it's setup.
Wonderful to hear that my contribution will be part of tinyusb. And many thanks for your review. It was very constructive and helpful for rechecking some parts.