balena-engine
balena-engine copied to clipboard
Boost daemons' CPU scheduling priority
In order to keep balenaEngine responsive during situations of high system load, this commit increases the scheduling priority of balenad
and balena-engine-containerd
.
They are both assigned to real-time scheduling policies, and in the case of balena-engine-containerd
we make sure to also set the SCHED_RESET_ON_FORK
flag to make sure that user containers don't inherit the increased priority.
Did manual testing, using this on a Pi Zero that is running a large number of stress
processes with CPU, memory, I/O and disk loads. Generally increased the time the Engine survives before being killed by the watchdog (but see the caveats list below).
Some possible caveats:
- While this generally improved the Engine survivability (even allowing it to run for multiple hours in some cases), it's far from perfect. It's not rare to have the watchdog killing the Engine after a couple dozen of minutes, and things like updating the user containers drastically increase the chances of triggering the watchdog.
- This PR deals only with CPU. We need to do something similar for I/O.
- What I did Change the scheduling policy of balenaEngine daemons to Round Robin (one of the Linux real-time scheduling policies).
- How I did it
By calling the SYS_SCHED_SETSCHEDULER
syscall on each of the daemons' threads.
- How to verify it Manual testing, running this version of the Engine on a Pi Zero with a stress-testing container and comparing with a vanilla Engine.
- Description for the changelog Boost daemons' CPU scheduling priority
Your landr site preview has been successfully deployed to https://landr-balena-os-repo-balena-engine-preview-279.netlify.app
Deployed with Landr 6.35.9
Is there any downside to this PR besides not addressing all the issues? Is there any aspect of this change that would make it potentially worse than what we have in the field today?
All improvements are good at this point, even if we know there are other changes we need.
@klutchell, I was just thinking of a possible "regression": during a user container update, I think the Engine may monopolize one core as it decompresses the new image as it is downloaded. In slow, single-core device (as the Pi Zero) the user container may now struggle to do any real work until the update finishes. Previously it would be able to run with the same priority as the Engine itself. (Of course this is all very tricky to balance... we are doing all this to avoid the opposite, that is, to have user containers CPU-starving the Engine)
Side note: I am running a new batch of tests with this version. I often get no significant improvements, but sometimes it runs for hours. I still want to understand better "what different" happens between these extremes.
Closing as obsolete. This was motivated by problems caused by our health checks taking too long to execute, and we solved this problem by using lighter weight Engine health checks.