async-mqtt-client icon indicating copy to clipboard operation
async-mqtt-client copied to clipboard

Various crashes on publish

Open rousveiga opened this issue 3 years ago • 14 comments

Hello. I have been experienced crashes on publish. My sketch publishes on three topics every second, and the crashes always happen on the second or third publish invocation for each round. Here are my decoded stack traces:

Third invocation

0x4008d5a4: invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 155
0x4008d7d5: abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 170
0x4015964f: __cxxabiv1::__terminate(void (*)()) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc line 47
0x40159696: std::terminate() at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc line 57
0x401592f1: __cxxabiv1::__cxa_allocate_exception(std::size_t) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_alloc.cc line 268
0x40159074: operator new(unsigned int) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc line 54
0x400eb6bf: AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int, bool, unsigned short) at C:\Users\Rosa\Documents\Arduino\libraries\async-mqtt-client\src\AsyncMqttClient.cpp line 737

Second invocation

0x4008d5a4: invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 155
0x4008d7d5: abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 170
0x4015964f: __cxxabiv1::__terminate(void (*)()) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc line 47
0x40159696: std::terminate() at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc line 57
0x401592f1: __cxxabiv1::__cxa_allocate_exception(std::size_t) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_alloc.cc line 268
0x40159074: operator new(unsigned int) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc line 54
0x400ed9ed: AsyncMqttClientInternals::PublishOutPacket::PublishOutPacket(char const*, unsigned char, bool, char const*, unsigned int) at c:\users\rosa\appdata\local\arduino15\packages\esp32\tools\xtensa-esp32-elf-gcc\1.22.0-80-g6c4433a-5.2.0\xtensa-esp32-elf\include\c++\5.2.0\ext/new_allocator.h line 104
0x400eb6d1: AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int, bool, unsigned short) at C:\Users\Rosa\Documents\Arduino\libraries\async-mqtt-client\src\AsyncMqttClient.cpp line 737
0x4008d5a4: invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 155
0x4008d7d5: abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 170
0x4015964f: __cxxabiv1::__terminate(void (*)()) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc line 47
0x40159696: std::terminate() at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc line 57
0x401592f1: __cxxabiv1::__cxa_allocate_exception(std::size_t) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_alloc.cc line 268
0x40159074: operator new(unsigned int) at /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc line 54
0x400eb6bf: AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int, bool, unsigned short) at C:\Users\Rosa\Documents\Arduino\libraries\async-mqtt-client\src\AsyncMqttClient.cpp line 737

If there is any other info that I could post to help diagnose the issue, please tell me and I'll try to provide it.

I also get a Guru Meditation Error: Core 0 panic'ed (StoreProhibited), but the stack trace points to AsyncTCP as a the source - should I post the issue here or there?

Thanks in advance.

rousveiga avatar Jun 01 '21 09:06 rousveiga

Is it reproducable with generic code? Are you using the master or develop branch?

I can look into the issue next week.

bertmelis avatar Jun 03 '21 07:06 bertmelis

I'm using the develop branch, latest commit.

To be honest, I haven't tried to reproduce it with generic code - I will try to put together a MWE.

rousveiga avatar Jun 03 '21 07:06 rousveiga

Sorry for the delay, I have a MWE now: https://pastebin.com/ug5Hs8mS. Here are the logs: https://drive.google.com/file/d/16gM9VgLNqQPIPxVR_fOmROubKuvtcotx/view?usp=sharing

The logfile is pretty big, so here's the interesting part: https://pastebin.com/sFHesgC5

This example also reproduced another issue I had been having with reconnection to the broker, so I'll be opening a new issue for that.

rousveiga avatar Jun 25 '21 08:06 rousveiga

The crash reason is not in the log... the "Guru meditation error..." contains why the code crashed. It could be the watchdog, a dangling pointer... Is there a way to get the error code?

https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/fatal-errors.html

bertmelis avatar Jul 01 '21 11:07 bertmelis

The Guru Meditation error I get does show up in the full log. I don't have access to that file at the moment but I'll paste it here as soon as I can.

It happens at a different point in time (IIRC it does happen at a publish invocation but not inmediately after or before the specific error instances I pasted in the first post), so I was not sure it was related to this issue.

rousveiga avatar Jul 03 '21 22:07 rousveiga

Most obvious reason is an invalid pointer dereference. Mqtt messages are stored on the heap and cleaned up at regular intervals. Maybe (despite I tested a lot) there is an issue here.

Could also be the watchdog but this is more a esp8266 thing.

bertmelis avatar Jul 05 '21 05:07 bertmelis

Here's the Guru Meditation Error I get in the MWE:

loop: wifi connected
loop: mqtt connected
PUBLISH
new back #3
PUBLISH
new back #3
PUBLISH
new back #3
Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.
Core 0 register dump:
PC      : 0x400d70c7  PS      : 0x00060530  A0      : 0x8011c01d  A1      : 0x3ffb3e90  
A2      : 0x3ffc117c  A3      : 0x3ffd00c0  A4      : 0x00000005  A5      : 0x00000020  
A6      : 0x00000020  A7      : 0x00000000  A8      : 0x00000004  A9      : 0x3ffb3e70  
A10     : 0x00000000  A11     : 0x40085750  A12     : 0x3ffb99e8  A13     : 0x00000000  
A14     : 0x00000000  A15     : 0x00000000  SAR     : 0x00000019  EXCCAUSE: 0x0000001d  
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  

Backtrace: 0x400d70c7:0x3ffb3e90 0x4011c01a:0x3ffb3ec0 0x4011c4e6:0x3ffb3ee0 0x4011d00f:0x3ffb3f00 0x4011cef1:0x3ffb3f20 0x4011d0ac:0x3ffb3f40 0x401173b0:0x3ffb3f60 0x400893c5:0x3ffb3f90

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)

Decoded backtrace is:

PC: 0x400d70c7: _tcp_poll(void*, tcp_pcb*) at C:\Users\Rosa\Documents\Arduino\libraries\AsyncTCP-master\src\AsyncTCP.cpp line 261
EXCVADDR: 0x00000000

Decoding stack results
0x400d70c7: _tcp_poll(void*, tcp_pcb*) at C:\Users\Rosa\Documents\Arduino\libraries\AsyncTCP-master\src\AsyncTCP.cpp line 261
0x4011c01a: tcp_slowtmr at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/tcp.c line 1213
0x4011c4e6: tcp_tmr at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/tcp.c line 164
0x4011d00f: tcpip_tcp_timer at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/timeouts.c line 142
0x4011cef1: sys_check_timeouts at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/timeouts.c line 342
0x4011d0ac: sys_timeouts_mbox_fetch at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/timeouts.c line 435
0x401173b0: tcpip_thread at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/api/tcpip.c line 111
0x400893c5: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 143

Exactly like the Guru Meditation Error I got on my original code, it points to AsyncTCP; should I repost the issue there?

According to the documentation you linked, and since EXCVADDR is zero, it does indeed look like a NULL dereference.

rousveiga avatar Jul 05 '21 07:07 rousveiga

I'm on holiday now so I can't look into this in detail. I obviously also cant test it myself at this moment.

It is odd that the stack trace does not contain any functions from this codebase. Unless meanwhile solved, I'll look into this in two weeks.

bertmelis avatar Jul 05 '21 14:07 bertmelis

I'm on holiday now so I can't look into this in detail. I obviously also cant test it myself at this moment.

Whenever you can and want. Have a nice holiday!

It is odd that the stack trace does not contain any functions from this codebase.

Yes, I thought so myself.

rousveiga avatar Jul 05 '21 14:07 rousveiga

Just to touch on this, I have also narrowed down crashes on my system to also be related to calling publish.

There is a similarity too in that new(unsigned int) is lurking within the exception decode, but the stack trace is NOT in the async MQTT library, instead the async httpsserver, so maybe it is an AsyncTCP issue. It very much does smell of a corrupted heap somewhere, but the common aspect does appear to be the actions of publish that may be causing that....

rljonesau avatar Dec 06 '21 08:12 rljonesau

Hi, I also had a lot of issues with this but when I plotted the free heap, it was getting empty over 10min time or so. So I figured the heap wasn't cleaned up in time. I added a delay between publishes of 100ms (could be less but did not test it) or so and problem seems to be solved.

diskgokey avatar Apr 07 '22 12:04 diskgokey

same here, I put all publishes in a queue (size 300 messages) and use a delay of 100ms. Works fine.

proddy avatar Apr 07 '22 12:04 proddy

May I find this a but odd. The mqtt lib already puts the messages in a queue so there should not be adifference.

If it really is a heap (fragmentation?) issue, you could try to raise the minimum free heap requirement in this lib. There is a #define somewhere.

bertmelis avatar Apr 08 '22 07:04 bertmelis

https://github.com/marvinroger/async-mqtt-client/blob/bf341978ad506a78a3d56c560ba34c1d08bb4794/src/AsyncMqttClient.hpp#L9

diskgokey avatar Jun 10 '22 13:06 diskgokey