ArduinoCore-renesas icon indicating copy to clipboard operation
ArduinoCore-renesas copied to clipboard

Network Related Crash on Long Running MQTT connections

Open Spinnaker-design opened this issue 1 year ago • 6 comments

I am seeing a crash on the Portenta C33 when using an MQTT client for a long duration (~15 minutes). The crash occurs within the delay call and occurs within the lwip_task of CNetIF.cpp. It certainly looks like we are seeing a memory management issue with the networking code.

We are using an SSL Client and certificates for our server authentication.

Spinnaker-design avatar Feb 14 '24 17:02 Spinnaker-design

Here is the call stack for the crash:

_free_r@0x00060a0a (/_free_r.dbgasm:51)
__gnu_cxx::new_allocator<CMsg>::deallocate@0x0005ae5e (/Users/kylevisner/.platformio/packages/[email protected]/arm-none-eabi/include/c++/7.2.1/ext/new_allocator.h:125)
std::allocator_traits<std::allocator<CMsg> >::deallocate@0x0005ae5e (/Users/kylevisner/.platformio/packages/[email protected]/arm-none-eabi/include/c++/7.2.1/bits/alloc_traits.h:462)
std::_Deque_base<CMsg, std::allocator<CMsg> >::_M_deallocate_node@0x0005ae5e (/Users/kylevisner/.platformio/packages/[email protected]/arm-none-eabi/include/c++/7.2.1/bits/stl_deque.h:609)
std::_Deque_base<CMsg, std::allocator<CMsg> >::_M_destroy_nodes@0x0005ae5e (/Users/kylevisner/.platformio/packages/[email protected]/arm-none-eabi/include/c++/7.2.1/bits/stl_deque.h:743)
std::_Deque_base<CMsg, std::allocator<CMsg> >::~_Deque_base@0x0005ae74 (/Users/kylevisner/.platformio/packages/[email protected]/arm-none-eabi/include/c++/7.2.1/bits/stl_deque.h:665)
std::deque<CMsg, std::allocator<CMsg> >::~deque@0x0005b1c4 (/Users/kylevisner/.platformio/packages/[email protected]/arm-none-eabi/include/c++/7.2.1/bits/stl_deque.h:1045)
std::queue<CMsg, std::deque<CMsg, std::allocator<CMsg> > >::~queue@0x0005b1c4 (/Users/kylevisner/.platformio/packages/[email protected]/arm-none-eabi/include/c++/7.2.1/bits/stl_queue.h:96)
CEspCom::clearToEspQueue@0x0005b1c4 (/CEspCom::clearToEspQueue.dbgasm:109)
esp_host_there_are_data_to_be_tx@0x0005a6e4 (/esp_host_there_are_data_to_be_tx.dbgasm:12)
esp_host_spi_transaction@0x0005a6f8 (/esp_host_spi_transaction.dbgasm:5)
esp_host_perform_spi_communication@0x0005a73e (/esp_host_perform_spi_communication.dbgasm:7)
CEspControl::communicateWithEsp@0x00058ed8 (/CEspControl::communicateWithEsp.dbgasm:10)
CLwipIf::lwip_task@0x0004c0a8 (/CLwipIf::lwip_task.dbgasm:30)
CLwipIf::timer_cb@0x0004c10a (/CLwipIf::timer_cb.dbgasm:4)
r_gpt_call_callback@0x0002e174 (Unknown Source:1719)
<signal handler called>@0xffffffe9 (Unknown Source:0)
bsp_prv_software_delay_loop@0x0002f864 (/bsp_prv_software_delay_loop.dbgasm:1)
delay@0x00023c0a (/delay.dbgasm:4)
SSLClient::read@0x0001f628 (/SSLClient::read.dbgasm:8)
SSLClient::connected@0x0001f5b8 (/SSLClient::connected.dbgasm:10)

Spinnaker-design avatar Feb 14 '24 19:02 Spinnaker-design

Thanks for your report, I got the same error while working on https://github.com/arduino/ArduinoCore-renesas/pull/234. In that PR I am trying to deal with all the network related issues, for the time being Ethernet and WiFi. I will try to address this issue with that PR.

andreagilardoni avatar Feb 15 '24 14:02 andreagilardoni

Thanks, @andreagilardoni, Is there a workaround in the mean time to unblock us until that PR is done?

Spinnaker-design avatar Feb 15 '24 16:02 Spinnaker-design

You can try using my PR and disable the timer inside the network stack.

  • taking as reference the example here
  • You need to comment this line
  • You need to call CLwipIf::getInstance().task() inside the loop() function
  • Design your application to avoid blocking calls as much as possible

Any kind of feedback on this work is appreciated.

andreagilardoni avatar Feb 15 '24 20:02 andreagilardoni

@andreagilardoni was able to build with you PR, 2 items

  • if you comment out line 30 of CNetIf.h, you'll get a build error.
  • if you attempt to build it with CLwipIf::getInstance().task(), you'll get the following error:

Compilation error: 'class CLwipIf' has no member named 'task'

Spinnaker-design avatar Feb 20 '24 21:02 Spinnaker-design

Well, after many weeks of wireless networking problems on the C33 platform, it looks like there are no fixes anytime soon. On our system we even "disable" networking after power-on (and brief use to access NTP), but the networking still causes a system hang after many hours of running (rare but fatal). It appears that there is something the class destructors are not doing correctly, since fragments of "WiFi" functionality are left operating after disconnection/shutdown. I think the advertisements for the Arduino C33 should NOT list networking, since it doesn't work correctly as yet.

zsnave avatar Jun 19 '24 19:06 zsnave

Hello Have you find a way to fix this issue which is very annoying ?

Jérémy

jeremypy972 avatar Nov 21 '24 20:11 jeremypy972