meta-balena Improve Engine healthchecks

Two commits here, each with a different improvement to Engine healthchecks, plus one adding automated tests for Engine healthchecks. Details follow.

Manual tests performed:

Ran for 20+ hours on a Pi Zero with 6 containers, each running stress --cpu 25 --vm 15 --vm-bytes 1000000 --io 10 --hdd 10 --hdd-bytes 4096 without having the watchdog kick in to kill the Engine.
Timed execution of the new healthcheck on a Pi Zero with the same load as above. Of 150 executions, only 4 took 8+ seconds (with a maximum of 15.2s). All the other were in the 5s-7.99s second range.
SIGSTOPd balenad to confirm that the watchdog would eventually kick in.
Same for balena-engine-containerd.
Resolves https://github.com/balena-os/meta-balena/issues/1676
Resolves https://github.com/balena-os/meta-balena/issues/2423
Resolves https://github.com/balena-os/meta-balena/issues/2326
Resolves https://github.com/balena-os/meta-balena/issues/2327
Resolves https://github.com/balena-os/balena-engine/issues/196

Automated healthchecks added:

Send a SIGSTOP to containerd to force the healthcheck to kick in. Check if the Engine eventually recovers.
Time the execution of the healthcheck script to detect performance regressions. This is dependent on the device performance, so it's less than ideal. It was tuned to not give false positives with the new healthcheck on a Pi Zero and should still be able to catch substantial performance regressions on a Pi 3. It Is probably prone to false negatives on fast devices).

Use a lightweight Engine healthcheck

Previously, our healthcheck verified if we were able to start a new
container. This had two downsides:

1. It was relatively heavyweight. In devices under heavy load, it would
   sometimes take so long to run that the Engine was killed by the
   watchdog.
2. It wrote to the storage media. Creating a container involves writing
   some data to persistent storage, therefore the healthcheck was
   wearing the storage media.

This new healthcheck simply pings both `balenad` and `containerd`, which
is much faster than starting a new container and doesn't write to disk.
The step of pinging `containerd` is important because we have seen at
least one case in the past in which `balenad` was working but
`containerd` wasn't.

Make Engine watchdog termination graceful

With `WatchdogSignal=SIGTERM` systemd will send a SIGTERM and give the
Engine 90 seconds to gracefully shutdown before sending a SIGKILL. We
had cases of Engine metadata on disk getting corrupted after the
watchdog sent it a SIGKILL directly. This change shall minimize this
issue.

Contributor checklist

[x] Changes have been tested
- [x] Covered in automated test suite
- [ ] Manual test case recorded
[x] Change-type present on at least one commit
[x] Signed-off-by is present
[x] The PR complies with the Open Embedded Commit Patch Message Guidelines

Reviewer Guidelines

When submitting a review, please pick:
- 'Approve' if this change would be acceptable in the codebase (even if there are minor or cosmetic tweaks that could be improved).
- 'Request Changes' if this change would not be acceptable in our codebase (e.g. bugs, changes that will make development harder in future, security/performance issues, etc).
- 'Comment' if you don't feel you have enough information to decide either way (e.g. if you have major questions, or you don't understand the context of the change sufficiently to fully review yourself, but want to make a comment)

Jul 19 '22 14:07 lmbarros

It looks like we could also remove the balena-healthcheck-image related bits from meta-balena-common/recipes-containers/balena/balena/balena-healthcheck, meta-balena-common/recipes-containers/docker-disk/files/entry.sh, and meta-balena-common/recipes-containers/docker-disk/docker-disk.bb.

@jakogut I think I did this already on this PR. Maybe a missed some bits?

Jul 20 '22 20:07 lmbarros

@lmbarros Disregard, I got my wires crossed. Looks good to me!

Jul 20 '22 20:07 jakogut

hi @lmbarros there's some cleanup left

Thanks @alexgg, I updated the PR.

also, did you find out what it takes to starve the engine with this new healthcheck?

I didn't explore this in depth, but here are two data points:

With the parameters I used on the tests above (i.e., 6 containers, each running stress --cpu 25 --vm 15 --vm-bytes 1000000 --io 10 --hdd 10 --hdd-bytes 4096), the Engine itself was running fine, but the healthchecks for the Supervisor (very often) and (more rarely) chrony were failing.
I tried tripling the number of stress processes (still on 6 containers). In this case, the Engine was killed after ~45 minutes. The average running time of the health checks was still under 1 minute, but some outliers would take much longer.

Jul 21 '22 16:07 lmbarros

@lmbarros Is this still a draft?

Aug 10 '22 18:08 jakogut

@jakogut, I am happy enough with the PR since the last update. Making it "ready for review".

Aug 10 '22 18:08 lmbarros

Looks good to me. Any additional comments, @alexgg? Your approval will be needed to merge this, as you requested changes.

Aug 16 '22 19:08 jakogut

@balena-ci rebase

Aug 17 '22 13:08 alexgg

meta-balena meta-balena copied to clipboard

Improve Engine healthchecks

Contributor checklist

Reviewer Guidelines

meta-balena
meta-balena copied to clipboard