nuttx [HELP] STM32H7 hardfault in lpwork

Description

Hello everyone,

We are using a STM32H7 based board and recently updated our NuttX version from 9.1.0 to the 12.6.0-RC1 tag. We are using the TCP/IP unbuffered networking to send data from the STM32H7 MCU to our main controller. When there is alot of traffic between the two, we notice some big delays in replies and after a while the app crashes, on NuttX 9.1.0 we did not have this issue. When we switch the NuttX 12.6.0 networking to be buffered (CONFIG_NET_TCP_WRITE_BUFFERS), the network behaves better as there are not so many frequent delays in the replies and the app does not crash.

What could be the cause of a hardfault in the lpwork when using unbuffered networking?

The Network Driver buffer configuration is set as follows: CONFIG_NET_RECV_BUFSIZE = 32768 (32kb) The stack dump on hard fault looks like this when using unbuffered networking (used arm-none-eabi-addr2line to convert all the flash addresses):

0x0809b2bf tcp_recvhandler /Nuttx/net/tcp/tcp_recvfrom.c:500

0x080263a7 devif_conn_event /Nuttx/net/devif/devif_callback.c:521

0x08025c87 tcp_callback /Nuttx/net/tcp/tcp_callback.c:308

0x08027319 tcp_input /Nuttx/net/tcp/tcp_input.c:1547

0x08026405 ipv4_in /Nuttx/net/devif/ipv4_input.c:149

0x080264bf ipv4_in /Nuttx/net/devif/ipv4_input.c:153

0x080269cf netdev_input /Nuttx/net/netdev/netdev_input.c:91

0x08021dbd stm32_receive /Nuttx/arch/arm/src/chip/stm32_ethernet.c:1918

0x08022fe5 up_irq_save /Nuttx/include/arch/armv7-m/irq.h:416

0x08023a71 nxtask_start /Nuttx/sched/task/task_start.c:122

Verification

[X] I have verified before submitting the report.

Sep 25 '24 18:09 vladsomai

Hi @vladsomai is it possible to reproduce this issue in some common STM32 board with Ethernet? like nucleo-f746, etc? Is it possible to reproduce it using some existing net test application existing on nuttx-apps?

If you can reproduce it, please submit a board config that we could use for testing.

If you only can reproduce it on your board, I suggest you doing this test discover when the issue was introduced: grab some release version between 9.1 and 12.6 and copy your boards/arm/stm32h7/boardname to there (and include the 3 entries at boards/Kconfig). Repeat this search process until you discover in which release the issue was introduced, then you can do a quich git bisect to find the commit that introduced the issue.

@wengzhe do you have some idea about this issue?

Sep 26 '24 18:09 acassis

@wengzhe do you have some idea about this issue?

@acassis I'll try to take a look, we haven't tried unbuffered tcp with too much traffic before, because we're always using buffered one if we have a lot of data to send.

Sep 30 '24 07:09 wengzhe

Hello @acassis, I just got a nucleo-h743zi board and I will come back soon with a config if I can reproduce it. Trying multiple NuttX versions is time-consuming because migrating our app to a new version may take a couple of days.

@wengzhe thank you for replying on this thread, We tested the buffered send but we see a lot of "Spurious Retransmission" messages in the WireShark tracing when the packets are usually greater than CONFIG_NET_ETH_PKTSIZE. These retransmissions affect throughput badly.

So to summerize these are the current issues we noticed when stressing the network:

Buffered send has Spurious Retransmissions, affecting network throughput when having high traffic with packets greater than NET_ETH_PKTSIZE.
Unbuffered send does not have Spurious Retransmissions when the STM32H7 based board sends the packets, but the app crashes (with the above stack trace) when a message is sent from the main controller to the STM32H7 based board while traffic comes from the STM32H7 board.

Q1: Did you ever encounter Spurious Retransmissions when using buffered tcp? Q2: Which board config are you using to test the buffered and unbuffered tcp? We would like to take a look in the config you work with and test the most.

Sep 30 '24 11:09 vladsomai

Q1: Did you ever encounter Spurious Retransmissions when using buffered TCP?

If you encountered a "Spurious Retransmission", maybe the previous ACK is dropped or delayed in some place before sending into tcp_input, likely to be a driver or checksum issue.

Q2: Which board config are you using to test the buffered and unbuffered tcp? We would like to take a look in the config you work with and test the most.

We're not always using configs in the community (normally we use NuttX on our own product with corresponding driver), but we do have some tests on esp32c3-devkit:wifi these days (buffered only), or maybe sim:tcpblaster/qemu-armv8a:netnsh which are independent to any hardware.

Sep 30 '24 15:09 wengzhe

@wengzhe thanks for your input, I tested the tcpblaster app on our board and on the nucleo to compare the two. The nucleo board had max throughput on both client and server scenarios while our app had a lot of spurious retransmissions and dup ack errors. I investigated why the difference between our board and nucleo, and I found out the PHY was configured incorrectly. We are using the same software on different boards that have KSZ8081 and KSZ8895, which should be configured differently. We not managed to get 10 Mb/s consistently in tcpblaster and there are no spurious retransmissions anymore.

Using the Nucleo board, I made a comparison between buffered and unbuffered send and here are the results using tcpblaster: Unbuffered: 100-136 Kb/s with some spikes to 200Kb Buffered: 10 Mb/s consistently Based on this test we decided to only use the buffered send from now on.

I will mark this thread as closed. Cheers!

Oct 07 '24 18:10 vladsomai