tinyusb icon indicating copy to clipboard operation
tinyusb copied to clipboard

Rewrite of NCM device driver

Open rgrr opened this issue 1 year ago • 16 comments

Describe the PR This is a rewrite of the original NCM device driver. You can check tests concerning functionality and performance here.

Additional context Buggyness of the current driver can be easily verified with for MSS in 100 200 400 800 1200 1450 1500; do iperf -c 192.168.14.1 -e -i 1 -M $MSS -l 8192 -P 1; sleep 2; done.

I'm using the driver in my own project. The USB device is running together with Linux/Win10 without any problems.

This solves #2068.

rgrr avatar Aug 21 '23 07:08 rgrr

Hi, sorry for huge delay. I'm not an expert of NCM class but I'd like to move the things on.

Compared to RNDIS current NCM implementation is problematic, and new implementation doesn't perform well at high MSS on Linux, with STM32F723:

MSS vs Mbit/s RNDIS NCM with #2564 NCM New
100 7.25 7.04 10.2
200 14.5 6.95 12.5
400 21.4 6.75 6.71
800 34.8 6.97 8.02
1450 34.8 8.17 8.00

On the bus I observed high delay between OUT and IN transfer, also between 2 OUT transfers for both NCM and NCM New. It could also be a linux tweaking issue, on Windows 10 I got a throughput of ~30Mbps without these high delay. image

I'll try to review your implementation, below is the USB capture in case if you're interested. ncm_new.zip

HiFiPhile avatar Apr 28 '24 17:04 HiFiPhile

On the bus I observed high delay between OUT and IN transfer, also between 2 OUT transfers for both NCM and NCM New. It could also be a linux tweaking issue, on Windows 10 I got a throughput of ~30Mbps without these high delay

Hello HiFiPhile

very interesting observation. Could you do the little benchmark comparison also with Win10.

rgrr avatar Apr 29 '24 07:04 rgrr

very interesting observation. Could you do the little benchmark comparison also with Win10.

It seems like iperf on Windows doesn't support MSS adjustment, I see no difference in captured packet.

On Windows there is only one TCP ACK packet each time instead of 2, also delay between transfer is reduced. I got a throughput of 24Mbps. image

Here is the USB capture : ncm_enum win.zip


One thing strange is on USB bus ACK seq number is shifted from data packet, or maybe it's a decoding issue.

image image

Here is the Netif capture on Linux: netif.zip

HiFiPhile avatar Apr 29 '24 18:04 HiFiPhile

Hmmm... don't know what to do here. Unfortunately I don't have any Hi-Speed device in my drawer, so incapable of checking the above.

At least I have a project using NCM which also works in conjunction with Win10. Speed is comparable. My major concern was compatibility.

rgrr avatar May 01 '24 04:05 rgrr

I was able to massively boost the performance simply by increasing TCP_WND !

MSS vs Mbps default x2 x4
100 10.2 10.3 10.4
200 12.5 18.9 19.2
400 6.71 24.9 32
800 8.02 12.5 46.6
1450 8.00 14.8 70.4

MSS=1450 comparaison

  • TCP_WND = TCP_MSS

Instead of one complete datagram 2 segments are sent, both are acked, then high delay (~1ms) until next transfer. image

  • TCP_WND = 2*TCP_MSS

Two complete datagrams are sent, both are acked, then high delay (~1ms) until next transfer. image

  • TCP_WND = 4*TCP_MSS

Two complete datagrams are sent, only 1 ack, ack delay is a little bit higher maybe for processing, no delay until next transfer. image


It makes me to believe there must be some relationships between TCP window and NTB size to make it works better, so I decreased NTB size to 1600 while keeping TCP_WND = 2*TCP_MSS:

MSS vs Mbps default x2 x4 x2; NTB 1600
100 10.2 10.3 10.4 10.7
200 12.5 18.9 19.2 20.6
400 6.71 24.9 32 38.4
800 8.02 12.5 46.6 44.8
1450 8.00 14.8 70.4 68

2 Packets are sent and 2 ack, no delay until next transfer. image

Below are all USB captures: tcp_wnd_test.zip


Bonus: With #2576 applied, I got 116Mbps @ MSS=1450, TCP_WND = 4*TCP_MSS !

HiFiPhile avatar May 01 '24 11:05 HiFiPhile

Wow... those are great numbers. Out of curiosity: did you test different TCP_WND with RNDIS as well?

And the other point: anything I can do on the driver side to enhance throughput?

rgrr avatar May 01 '24 17:05 rgrr

Out of curiosity: did you test different TCP_WND with RNDIS as well?

Here is RNDIS on Arch Linux:

MSS vs Mbps default x2 x4
100 7.48 7.54 7.74
200 14.7 14.1 14.7
400 25.6 27.8 27.2
800 38.8 41.8 50.9
1450 38.8 65.4 71.9

At high MSS throughput is little higher than my previous test... so TCP_WND helps RNDIS as well. On Windows 10 there isn't much difference, always at ~11Mbps @ 1450.

And the other point: anything I can do on the driver side to enhance throughput?

With your understanding is my finding about TCP_WND vs NTB size coherent ? How did you chose the size of 3200 ?

It looks like TCP_WND=2*TCP_MSS and NTB_SIZE=1600 has a good performance and balanced RAM usage, does it works well in your case ?

HiFiPhile avatar May 01 '24 18:05 HiFiPhile

Meaning, that RNDIS benefits only for MSS=800/1450 (and is still faster than NCM for this cases :-/).

Buffer size: I have taken the 3200 from the original TinyUSB NCM driver. So no thoughts from my side concerning buffer size.

rgrr avatar May 01 '24 19:05 rgrr

Hmmm... currently checking Zephyr sources. They do not transmit frames from device -> host which are > NET_ETH_MAX_FRAME_SIZE, which is 1500+header_size

rgrr avatar May 01 '24 19:05 rgrr

Not sure if lwip's iperf server is buggy or something else, reverse throughput doesn't work well, most cases it just doesn't transfer. Also looks like a it always use TCP_MSS as MSS.

MSS vs Mbps RNDIS NCM
100 0 0
200 0 0
400 0 0
800 23 16
1200 23 16
1450 0 0

NTB is not used efficiently, always cut in half. image

HiFiPhile avatar May 02 '24 19:05 HiFiPhile

Yes, I had the feeling as well and as far as I remember I found a bug in iperf sources but was too busy/lazy to put the fix into their merge machinery.

You can give the attached file a try.

lwiperf.txt

Another point concerning buffer/frame sizes: I'm currently checking Zephyr sources. They do not transmit frames from device -> host which are > NET_ETH_MAX_FRAME_SIZE, which is 1500+header_size

rgrr avatar May 02 '24 20:05 rgrr

PS: Just found kernel use 2048 as minimal size: https://github.com/torvalds/linux/blob/f03359bca01bf4372cf2c118cd9a987a5951b1c8/include/uapi/linux/usb/cdc.h#L447

https://github.com/torvalds/linux/blob/f03359bca01bf4372cf2c118cd9a987a5951b1c8/drivers/net/usb/cdc_ncm.c#L161

HiFiPhile avatar May 02 '24 20:05 HiFiPhile

Ah... yes... one more thing: in function_ncm.c there is a definition of up/downlink speed. Never really tried values for tuning but perhaps the Linux NCM driver is smart and checks those values?

rgrr avatar May 02 '24 20:05 rgrr

PS: Just found kernel use 2048 as minimal size:...

Meaning CONFIG_CDC_NCM_XMT/RCV_NTB_MAX_SIZE > 2048 ?

rgrr avatar May 02 '24 20:05 rgrr

Never really tried values for tuning but perhaps the Linux NCM driver is smart and checks those values?

Already tried, I set it to 480Mbps and nothing changes.

HiFiPhile avatar May 02 '24 20:05 HiFiPhile

Meaning CONFIG_CDC_NCM_XMT/RCV_NTB_MAX_SIZE > 2048 ?

I think so, performance is 2-3% lower than 1600 as most parts of last packet is wasted.

HiFiPhile avatar May 02 '24 20:05 HiFiPhile

I got some major throughput regression updating LWIP to master, @ 800 MSS it went down to 37Mbps from 46 Mbps.

I tried your lwiperf but it's not better, instead of having 0bps iperf just hung.

Log shows:

lwiperf_tcp_poll
...
lwiperf_tcp_poll
lwiperf_tx_start_impl
lwiperf_tcp_client_connected
lwiperf_tcp_client_send_more a
lwiperf_tcp_client_send_more b
lwiperf_tcp_poll
...
lwiperf_tcp_poll
lwiperf_tcp_client_send_more end time

Revert LWIP to 2.1.2, another finding is reduce CFG_TUD_NCM_OUT_NTB_N and CFG_TUD_NCM_IN_NTB_N to 1 doesn't affect performance at all and save a lot of RAM.

To resume:

  • CFG_TUD_NCM_OUT_NTB_N=1
  • CFG_TUD_NCM_IN_NTB_N=1
  • CFG_TUD_NCM_IN_NTB_MAX_SIZE=2048
  • CFG_TUD_NCM_OUT_NTB_MAX_SIZE=2048
  • TCP_SND_BUF=(4 * TCP_MSS)
  • TCP_WND=(4 * TCP_MSS)
  • PBUF_POOL_SIZE=4

Transmit:

MSS vs Mbps RNDIS NCM
100 7.54 10.7
200 14.6 20.8
400 26.9 38.1
800 46.7 66.3
1450 70.2 89.3

Receive, don't know why MSS 200 & 400 start to work and 1200 no longer work:

MSS vs Mbps RNDIS NCM
100 0 0
200 8.42 6.93
400 16.8 15.1
800 23.4 36.1
1450 0 0

For a similar Flash & RAM footprint NCM's overall performance is better.

HiFiPhile avatar May 04 '24 14:05 HiFiPhile

Interesting... when I implemented the driver, I did a lot of performance testing. And honestly: a lot was also guess work. One of the assumptions was, that having at least two buffers for each RX/TX would be of benefit, because while one frame is sent, the other can be filled (TX) or while a received frame is being handled it should be a good idea to have one concurrently receiving new data.

And I'm almost sure, that on my relatively slow RP2040 that was the case: I'm getting there around 5MBit/s in both directions, so the link is at 50% usage.

For your case with hi-speed, it probably doesn't matter because transfer is not the bottleneck.

Nevertheless, I have the feeling we are hunting some performance issues (which are optimizations and perhaps further to investigate).

Any real reason to not merge the driver?

rgrr avatar May 04 '24 20:05 rgrr

And I'm almost sure, that on my relatively slow RP2040 that was the case: I'm getting there around 5MBit/s in both directions, so the link is at 50% usage.

Could you make test if current settings works well in your case ?

Edit: I saw some improvement increasing NTBs with Full-Speed on STM32F407.

Nevertheless, I have the feeling we are hunting some performance issues (which are optimizations and perhaps further to investigate). Any real reason to not merge the driver?

Well main reason is lacking of ressource, even my own PR's are missing from review.

I'd like to ensure if everything works smoothly, since we often met the case where original authors are hard to be contacted when something happens, there are pending issues hard to fix.

We are close, once we choose a parameter set I'll do a formating pass then it's ready.

HiFiPhile avatar May 04 '24 20:05 HiFiPhile

I understand the situation & I'm actually one of those :-/

My last (side) project was based on RP2040 and its SDK which uses TinyUSB et al.

My current (professional) project is based on Zephyr, so chances are that I will drift away from TinyUSB.

Concerning the settings: of course I will test those. Give me some minutes.

rgrr avatar May 04 '24 20:05 rgrr

Gosh... the number of screws to turn tends to drive me crazy: on my device, if setting CFG_TUD_NCM_*_NTB_MAX_SIZE=2048 then iperf produces retries. This can be circumvented by setting wNtbOutMaxDatagrams=1, but this is actually not the intention. So the 3200 are not randomly selected. But don't ask me, why the 2048 length produces errors.

So here are my results (number=buffer-size, n=x means number of buffers):

MSS vs Mbps org 2048/n=1 3200/n=1
100 2.6 1.7 1.7
200 3.5 2.6 2.6
400 4.1 0.5 3.4
800 4.5 0.6 4.1
1450 4.7 4.5 4.5

So the current configuration isn't there for no reason, at least with my configuration. I've done also intensive tests with SystemView and outcome was the configuration you can see in the repo.

rgrr avatar May 04 '24 21:05 rgrr

Perhaps a parameter for wNtbOutMaxDatagrams should be added to allow the regular user as well to enter parameter hell!?

rgrr avatar May 04 '24 22:05 rgrr

fighting with the parameters and trying to understand... that's the current outcome:

#ifndef CFG_TUD_NCM_IN_NTB_MAX_SIZE
    /// must be >> MTU
    #define CFG_TUD_NCM_IN_NTB_MAX_SIZE        (2 * TCP_MSS + 100)     // can be set to 2048 without impact
#endif
#ifndef CFG_TUD_NCM_OUT_NTB_MAX_SIZE
    /// must be >> MTU
    #define CFG_TUD_NCM_OUT_NTB_MAX_SIZE       (2 * TCP_MSS + 100)     // can be set to smaller values if wNtbOutMaxDatagrams==1
#endif

Meaning: CFG_TUD_NCM_OUT_NTB_MAX_SIZE is dependent on TCP_MSS while CFG_TUD_NCM_IN_NTB_MAX_SIZE is not. CFG_TUD_NET_MTU=1514 in my case.

For me it is not clear, why CFG_TUD_NCM_OUT_NTB_MAX_SIZE must be twice TCP_MSS (plus some overhead). Is it a host driver issue or what?

Too many TCP parameters...

rgrr avatar May 05 '24 06:05 rgrr

Perhaps a parameter for wNtbOutMaxDatagrams should be added to allow the regular user as well to enter parameter hell!?

Yes why not.

So the current configuration isn't there for no reason, at least with my configuration. I've done also intensive tests with SystemView and outcome was the configuration you can see in the repo.

If you agree we can set NTB default at (2 * TCP_MSS + 100) and N=1 with a comment mentioning increasing N may increase performance in trade of RAM usage in some cases.

But don't ask me, why the 2048 length produces errors.

Me too I saw this behavior on FS, maybe a driver thing.

HiFiPhile avatar May 05 '24 16:05 HiFiPhile

I have added your suggestions. Currently the NCM driver is not built in the CI. Any idea how to add that?

rgrr avatar May 05 '24 18:05 rgrr

I'll do comment and format things :) Also I have to move #include "lwipopts.h" to example's tusb_config.h, the stack itself should have minimal external dependencies.

HiFiPhile avatar May 05 '24 19:05 HiFiPhile

I'll do comment and format things :) Also I have to move #include "lwipopts.h" to example's tusb_config.h, the stack itself should have minimal external dependencies.

Thank you very much for testing and reviewing.

rgrr avatar May 05 '24 19:05 rgrr

Thank you for your great effort for this new driver ! Now it's setup.

HiFiPhile avatar May 06 '24 22:05 HiFiPhile

Thank you for your great effort for this new driver ! Now it's setup.

Wonderful to hear that my contribution will be part of tinyusb. And many thanks for your review. It was very constructive and helpful for rechecking some parts.

rgrr avatar May 06 '24 23:05 rgrr