balena-allwinner
balena-allwinner copied to clipboard
Watchdog triggered during high load
We have observed that when devices are stressed a watchdog based device reset can occur. Particularly with disk usage that the default watchdog ping timeout can easily be overrun resulting in unexpected device restarts.
There will likely be multiple steps required in order to fully rectify this issue. Some possibilities include:
- upgrading systemd to v237+ which features improvements in the main loop to fix some observed bugs with this effect. This may be acheived by upgrading the pocky build depencency?
- increasing the watchdog timeout. The default on most desktop operating systems is minutes, 10 seconds might just be too aggressive for this feature
- possibly patching systemd to be more agressive in it's watchdog pings (tested in-house with minimal effect)
- allowing watchdogd to assume the responsibility of pinging the hardware watchdog instead of systemd. This could help if systemd is blocking on disk io or other blocking calls during it's main loop.
If it is possible to replicate this on other devices it may be more appropriate to raise this on meta-resin instead. I'll leave that to you to decide.
Here is a test which you can replicate easily. It can also be used to replicate the issue on Armbian if the systemd watchdog timeout is set to 10s as per the configuration here.
First, waste 350mb of memory:
mkdir /tmp/test
mount -t tmpfs none /tmp/test -o size=350m
dd if=/dev/zero of=/tmp/test/zero.txt bs=1024k count=350
Then startup a large number of sleep processes (copy and paste into terminal):
sleep 100 &
This is quite a harsh test, not reflective of real world loads - however this is an issue we have seen occur on real software (suspected during the application of delta updates).
It's worth noting that the Allwinner H3 has a max watchdog timeout value of 16s. Current systemd main loops are far in excess of this value.