meta-balena
meta-balena copied to clipboard
Improve Engine healthchecks
Two commits here, each with a different improvement to Engine healthchecks, plus one adding automated tests for Engine healthchecks. Details follow.
Manual tests performed:
-
Ran for 20+ hours on a Pi Zero with 6 containers, each running
stress --cpu 25 --vm 15 --vm-bytes 1000000 --io 10 --hdd 10 --hdd-bytes 4096
without having the watchdog kick in to kill the Engine. -
Timed execution of the new healthcheck on a Pi Zero with the same load as above. Of 150 executions, only 4 took 8+ seconds (with a maximum of 15.2s). All the other were in the 5s-7.99s second range.
-
SIGSTOP
dbalenad
to confirm that the watchdog would eventually kick in. -
Same for
balena-engine-containerd
. -
Resolves https://github.com/balena-os/meta-balena/issues/1676
-
Resolves https://github.com/balena-os/meta-balena/issues/2423
-
Resolves https://github.com/balena-os/meta-balena/issues/2326
-
Resolves https://github.com/balena-os/meta-balena/issues/2327
-
Resolves https://github.com/balena-os/balena-engine/issues/196
Automated healthchecks added:
- Send a
SIGSTOP
tocontainerd
to force the healthcheck to kick in. Check if the Engine eventually recovers. - Time the execution of the healthcheck script to detect performance regressions. This is dependent on the device performance, so it's less than ideal. It was tuned to not give false positives with the new healthcheck on a Pi Zero and should still be able to catch substantial performance regressions on a Pi 3. It Is probably prone to false negatives on fast devices).
Use a lightweight Engine healthcheck
Previously, our healthcheck verified if we were able to start a new
container. This had two downsides:
1. It was relatively heavyweight. In devices under heavy load, it would
sometimes take so long to run that the Engine was killed by the
watchdog.
2. It wrote to the storage media. Creating a container involves writing
some data to persistent storage, therefore the healthcheck was
wearing the storage media.
This new healthcheck simply pings both `balenad` and `containerd`, which
is much faster than starting a new container and doesn't write to disk.
The step of pinging `containerd` is important because we have seen at
least one case in the past in which `balenad` was working but
`containerd` wasn't.
Make Engine watchdog termination graceful
With `WatchdogSignal=SIGTERM` systemd will send a SIGTERM and give the
Engine 90 seconds to gracefully shutdown before sending a SIGKILL. We
had cases of Engine metadata on disk getting corrupted after the
watchdog sent it a SIGKILL directly. This change shall minimize this
issue.
Contributor checklist
- [x] Changes have been tested
- [x] Covered in automated test suite
- [ ] Manual test case recorded
- [x]
Change-type
present on at least one commit - [x]
Signed-off-by
is present - [x] The PR complies with the Open Embedded Commit Patch Message Guidelines
Reviewer Guidelines
- When submitting a review, please pick:
- 'Approve' if this change would be acceptable in the codebase (even if there are minor or cosmetic tweaks that could be improved).
- 'Request Changes' if this change would not be acceptable in our codebase (e.g. bugs, changes that will make development harder in future, security/performance issues, etc).
- 'Comment' if you don't feel you have enough information to decide either way (e.g. if you have major questions, or you don't understand the context of the change sufficiently to fully review yourself, but want to make a comment)
It looks like we could also remove the balena-healthcheck-image related bits from
meta-balena-common/recipes-containers/balena/balena/balena-healthcheck
,meta-balena-common/recipes-containers/docker-disk/files/entry.sh
, andmeta-balena-common/recipes-containers/docker-disk/docker-disk.bb
.
@jakogut I think I did this already on this PR. Maybe a missed some bits?
@lmbarros Disregard, I got my wires crossed. Looks good to me!
hi @lmbarros there's some cleanup left
Thanks @alexgg, I updated the PR.
also, did you find out what it takes to starve the engine with this new healthcheck?
I didn't explore this in depth, but here are two data points:
- With the parameters I used on the tests above (i.e., 6 containers, each running
stress --cpu 25 --vm 15 --vm-bytes 1000000 --io 10 --hdd 10 --hdd-bytes 4096
), the Engine itself was running fine, but the healthchecks for the Supervisor (very often) and (more rarely) chrony were failing. - I tried tripling the number of
stress
processes (still on 6 containers). In this case, the Engine was killed after ~45 minutes. The average running time of the health checks was still under 1 minute, but some outliers would take much longer.
@lmbarros Is this still a draft?
@jakogut, I am happy enough with the PR since the last update. Making it "ready for review".
Looks good to me. Any additional comments, @alexgg? Your approval will be needed to merge this, as you requested changes.
@balena-ci rebase