OpenMQTTGateway Lilygo RTL433 runs fine, then reboots every couple of minutes then hangs...

I am running 2 different Lilygo rtl433 ESP32 devices. They had been running fine for the last couple of weeks. Then recently, the one downstairs, started the following behavior.

Run fine for 1 or more days
Start cycle of rebooting every 1-10 minutes (for about 10-20 times)
Freeze/hang (requiring physical reboot)

The upstairs device seems completely stable. Both devices never seem to get below about 2K free rtl_433 stack space nor do they run out of memory.

It's unlikely to be a power issue as I have a LiIon battery pack backup on the Lilygo device. More generally, I don't think it's the hardware or firmware per-se as swapping the devices caused the swapped device to start exhibiting the above behavior.

Any ideas on what could be causing this?

Why does instability arise only after a couple of days?
Why does it suddenly start repeatedly rebooting every couple of minutes?
Why does it suddenly freeze and not even reboot?
Why does it affect only the device in one location?

The only thing I can think of is that a new device or device reading comes online that is only received at the downstairs location -- and somehow that triggers a bug or even a stack overflow (before it gets reported to the MQTT broker)...

@1technophile @NorthernMan54 have you seen anything like this?

Oct 21 '24 16:10 puterboy

Need to see the serial output

Oct 21 '24 19:10 NorthernMan54

OK...

N: [ OMG->MQTT ] topic: home-crap/OMG_lilygo_rtl_433_ESP-2/RTL_433toMQTT/TFA-303221/1/219 msg: {"model":"TFA-303221","id":219,"channel":1,"battery_ok":1,"temperature_C":16.4,"humidity":65,"sendmode":0,"mic":"CRC","protocol":"TFA Dostmann 30.3221.02 T/H Outdoor Sensor","rssi":-94,"duration":206998}
T: isAdupl?
T: Enqueue JSON
T: Queue length: 1
T: Min ind: 6
T: store code : 21 / 381081797
T: Col: val/timestamp
T: mem code : 194 / 381053903
T: mem code : 191 / 381056961
T: mem code : 103 / 381057973
T: mem code : 30 / 381076050
T: mem code : 218 / 381080207
T: mem code : 235 / 381081571
T: mem code : 21 / 381081797
T: mem code : 8 / 381036885
T: mem code : 83 / 381036930
T: mem code : 196 / 381045816
T: mem code : 30 / 381049598
T: mem code : 183 / 381052358
T: isAdupl?
T: no pub. dupl
T: Dequeue JSON
CORRUPT HEAP: Bad head at 0x3ffd9cf8. Expected 0xabba1234 got 0x00000020

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)

Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810

Then a few minutes later: (note log level reverted to 'Notice')

N: [ OMG->MQTT ] topic: homeassistant-crap/sensor/LaCrosse-TX141THBv2-0-216-temperature_C/config msg: {"stat_t":"+/+/RTL_433toMQTT/LaCrosse-TX141THBv2/0/216","dev_cla":"temperature","unit_of_meas":"°C","name":"Temperature","uniq_id":"LaCrosse-TX141THBv2-0-216-temperature_C","val_tpl":"{{ value_json.temperature_C | is_defined }}","stat_cla":"measurement","device":{"ids":["LaCrosse-TX141THBv2-0-216"],"cns":[["mac","LaCrosse-TX141THBv2-0-216"]],"mdl":"LaCrosse-TX141THBv2","name":"LaCrosse-TX141THBv2-0-216","via_device":"OMG_lilygo_rtl_433_ESP-2"}}
CORRUPT HEAP: Bad head at 0x3ffba1f8. Expected 0xabba1234 got 0x0000001c

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810

Then again:

N: [ OMG->MQTT ] topic: homeassistant-crap/sensor/LaCrosse-TX141THBv2-0-216-temperature_C/config msg: {"stat_t":"+/+/RTL_433toMQTT/LaCrosse-TX141THBv2/0/216","dev_cla":"temperature","unit_of_meas":"°C","name":"Temperature","uniq_id":"LaCrosse-TX141THBv2-0-216-temperature_C","val_tpl":"{{ value_json.temperature_C | is_defined }}","stat_cla":"measurement","device":{"ids":["LaCrosse-TX141THBv2-0-216"],"cns":[["mac","LaCrosse-TX141THBv2-0-216"]],"mdl":"LaCrosse-TX141THBv2","name":"LaCrosse-TX141THBv2-0-216","via_device":"OMG_lilygo_rtl_433_ESP-2"}}
CORRUPT HEAP: Bad head at 0x3ffba1f8. Expected 0xabba1234 got 0x00000024

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810

etc.

So somehow there is a heap corruption...

Note that the last reported stack low water mark was 936 bytes before the first crash and then about 3600 bytes before the subsequent crashes.

So, if it is exhausting stack, it must be doing that while trying to decode the current message so it doesn't get reported via the RFtoMQTT routine.

Again, it can run fine for hours or even days (in this case it ran fine for 4 days (!) before I start seeing the heap corruption. But once it occurs, then it seems to recur and reboot typically every few minutes for 5-10 or more times before resuming stability. In this case it rebooted about 6 times over the course of about 2.5 hours.

Any ideas what may be causing the heap corruption and/or how to troubleshoot?

The only thing I can think of is that there is some sensor that reports intermittently (maybe it's only turned on every once in a while) and which consumes excessive stack causing the heap corruption. Interestingly, I can't say for sure, but it seems that the crashes typically occur between 6 and 10AM in the morning. I don't think it's one of my known sensors because all mine seem to be working fine (and have been doing so for quite a while)

Oct 27 '24 22:10 puterboy

Maybe try to increase again the stack, and if this is still the problem we should check if there is a way to limit the decoder consumption and avoid such crashes.

Oct 27 '24 23:10 1technophile

That seems though to be a bit "brute force" Is there any way to log which decoder is being called and how much stack it uses? That way I could see what is causing the problem...

Oct 27 '24 23:10 puterboy

I like your idea of limiting decoder consumption and perhaps logging any time a decoder tries to use more than that amount so that we can know the name of the decoder and how much stack it sought to consume...

Oct 27 '24 23:10 puterboy

Add monitor_filters = esp32_exception_decoder to your configuration in platformio.ini

It should show exactly where the error occurred

Oct 27 '24 23:10 NorthernMan54

Add monitor_filters = esp32_exception_decoder to your configuration in platformio.ini

It should show exactly where the error occurred

I did this btw last time I had crashes and it pointed to increasing the stack size but happy to enable it again to see if I get the same error.

Oct 27 '24 23:10 puterboy

I received another half dozen reboots this morning with monitor_filters = esp32_exception_decoder but the serial log error messages seem to be the same as before.

N: [ OMG->MQTT ] topic: home-crap/OMG_lilygo_rtl_433_ESP-2/RTL_433toMQTT/Ambientweather-F007TH/2/178 msg: {"model":"Ambientweather-F007TH","id":178,"channel":2,"battery_ok":1,"temperature_C":16.61111,"humidity":41,"mic":"CRC","protocol":"Ambient Weather F007TH, TFA 30.3208.02, SwitchDocLabs F016TH temperature sensor","rssi":-53,"duration":189001}
CORRUPT HEAP: Bad head at 0x3ffba2f8. Expected 0xabba1234 got 0xffffffff

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810




ELF file SHA256: cc2ead56a1306f41

AND

CORRUPT HEAP: Bad head at 0x3ffba1f8. Expected 0xabba1234 got 0x00000024

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810




ELF file SHA256: cc2ead56a1306f41

etc.

So maybe it's not a decoder issue???? Or at least not like the ones causing https://github.com/1technophile/OpenMQTTGateway/issues/2043 where I got an error saying Debug exception reason: Stack canary watchpoint triggered (rtl_433_Decoder) which was fixed by increasing stack size.

Any ideas on how to debug this further? (of course I could just try increasing stack size further to see if that helps but I would like to get to the root of the issue)

Oct 28 '24 17:10 puterboy

Oh, If you run a build and upload it to the board, then run the monitor it will convert the backtrace addresses to actual lines of code, so we can pin point the issue.

Oct 29 '24 02:10 NorthernMan54

I did re-build and re-uploaded (using platformio) with monitor_filters = esp32_exception_decoder

Oct 29 '24 02:10 puterboy

That's weird, as it should have given a longer back trace. If you build a different project afterwards, it does break the feature.

Oct 29 '24 02:10 NorthernMan54

I will try to rebuild again...

Oct 29 '24 03:10 puterboy

I am not seeing anything in the backtrace other than what I reported above...

Meanwhile it ran fine for 2 weeks without any crash or reboot. Then today for about 90 minutes it rebooted every minute before stabilizing. I'm pretty sure it is being triggered by a wireless sensor that either appears periodically or periodically has an output state that causes the crash.

My evidence is as follows:

My other Lilygo RTL433 was not triggering when on the 3rd floor but then also started triggering every minute when I brought it down to the first floor where my main Lilygo RTL433 is located -- so it seems to be location dependent (and not an issue with the device itself)
Removing the antenna caused the reboots to stop -- so it seems to be triggered by receiving a sensor signal rather than independently by the device or code itself.

I just increased the stack size to 15000 to see if that solves the problem -- but unfortunately it's not crashing now so I have to wait until the bad sensor or sensor message comes online again...

Stay tuned...

Nov 22 '24 05:11 puterboy

Hi PuterBoy. What is the current status of this issue? I have similar problem, and looking for solution to it...

Jan 24 '25 07:01 Vasilich

This issue is stale because it has been open for 90 days with no activity.

Apr 25 '25 00:04 github-actions[bot]

It seems to be more stable since I increased the cache size though it still reboots occasionally (say once ever 5-10 days) I have not though had any repetitive reboots. But since I don't know the real root cause which was quite intermittent, it's hard to say whether this is a full fix.

Apr 25 '25 05:04 puterboy

This issue is stale because it has been open for 90 days with no activity.

Jul 25 '25 00:07 github-actions[bot]