DRAFT: Add hardware watchdog for the RP2040 platform
The watchdog must be enabled late in the boot stage since the rp2040's watchdog only allow for 8 seconds delay before resetting, and some of the initialization calls after the call to rp2040Setup() and even after the first call to rp2040Loop() blocks execution long enough to exceed this delay.
Due to this architectural problem, we still have a potential freeze on the rp2040 if it hangs during boot and initialization, but practical experience seems to indicate that in most cases, a freeze happens long into the running state (often days), so unless one builds a node to be deployed deep into the wilderness, we accept this risk.
There is not any likely solution to the short delay counter in the hardware watchdog since this is based on hardware registers.
The downside of adding a watchdog is that it hides (potential serious) bugs. Especially since the maximum timeout you can set is relatively short and it requires workarounds like this to make it work, I'm hesitant about it. For nRF52 we don't have a watchdog and for ESP32 it was increased from 45 to 90 seconds.
Have you confirmed it's properly resetting it when it receives a lot of packets back-to-back, reconnects to a client app when you have a lot of nodes in the DB, or reconnects to your Wi-Fi AP (this delay might be problematic), etc.?
Sorry for the late answer, vacation time got me :-D
I agree on the 8 seconds being to little time, my test devices had uptimes between 3 and 14 hours, so clearly a problem.
I have reworked the solution to provide the same 90 seconds timeout as for the esp32 device, I will mark this PR as draft until my devices has run for some days (aprox. 32 hours and counting into testing with my main node).
Sorry for the late answer, vacation time got me :-D
No problem, we're all doing this for fun :)
my test devices had uptimes between 3 and 14 hours, so clearly a problem.
If you have such short uptimes, it would be relatively easy to get to the root cause of the problem. You can e.g. connect another Pico as a debugger. The callstack when it hangs will likely give us a clue what's going wrong - it might be even something in the Arduino Pico core, like we had in e.g. https://github.com/meshtastic/firmware/pull/2558.
If it's now going to reboot every 3 hours, that's not a good solution either IMO, as you'll lose everything in RAM (e.g. received packet queue to be delivered to a phone), and every time it boots it sends out its NodeInfo to everyone and asks for a request, which leads to a storm of packets if it's in a good position.
Just to make it clear :-D the 3 hour reboots was after having only the 8 seconds timeout, so no need to debug into that ;)
I am fieldtesting the current code now to check what uptimes we get with the current revision, then I also need to make some tests to find a baseline - all of this takes some time, so I'll just leave it as a draft until I am a bit wiser on the general stability.
Just to make it clear :-D the 3 hour reboots was after having only the 8 seconds timeout, so no need to debug into that ;)
Ah, I see. Likely it wasn't returning to rp2040loop() often enough.
But still, in your current revision there's only an 8 second timeout. It's fixed in the hardware, I believe there's no way around it.
Hi @The3rdPlace , just checking in to see how your tests went ...
Closing this to clean our pull request queue since it's been waiting on progress for a while. Feel free to re-propose any time!