bosh-linux-stemcell-builder
bosh-linux-stemcell-builder copied to clipboard
Noble Numbat stemcell no longer uses monit
- Windows stemcell has precedence for not using monit
- There's a systemd shim that we might be able to use for releases that use bpm
- early reports are that systemd is noticeably faster than monit
- It'll be a breaking change
- We may create a "convenience" bash script for operators so that "monit summary" and "monit restart ..." work (TBD)
- We may force release authors who don't use bpm to cut new releases that use it
- bpm already accommodates releases that aren't meant to be run in a container (still runs in a namespace, but places almost no restrictions on that namespace), e.g. kubelet (kubo-release).
- release authors need a way to support jammy and noble at the same time.
- monit is super old, difficult to compile, and there may be CVEs we're not aware of because it's not installed via package.
- we can get rid of the cgroup / iptables stuff we created to harden monit
- See here for an in-depth analysis of why we should replace monit and how to best accomplish that.
👍 Another vote here to get rid of monit... hopefully in place of something better. Below I want to describe issues the App Runtime Platform WG has had with monit.
the monit version is old
It is so old that it does not support HTTP health checks.
we have written our own healthcheckers
- We wrote a generic healthchecker release that we vendor to run side by side with some of our components
- We configure it for gorouter, bosh-dns-adapter, and silk-daemon.
- We write a restart script for how to handle the output from the healthchecker. For example, for gorouter we will restart the gorouter the first 10 times, but after that we leave it up because it is bad for HA to keep restarting gorouter.
✨ It would be great we didn't have to maintain our own tools to do this healthchecking!
Was it discussed to move to a different service manager like systemd? Reading the document it sounds like the proposed way forward is with bpm only.
We have use-cases which require processes running directly in the host namespaces. In our setup with haproxy-boshrelease we have side-cars and end up mounting / sharing most of the things which are isolated by bpm to make them accessible again, defeating the purpose of isolation. There are other features which bpm currently prevents us from using, such as CPU pinning (there is only a single cgroup with all CPUs and no option to change that at the moment).
Personally, I'm struggling to see the need for isolating those processes in containers since we spawn dedicated VMs for deployments / jobs (this is not to say that there are no valid use-cases to do so).
In a bit more detail:
release authors need a way to support jammy and noble at the same time.
systemd
would be available on both.
We may force release authors who don't use bpm to cut new releases that use it
We have use-cases which require processes to run directly on the host which is not possible with BPM as I've understood?
bpm already accommodates releases that aren't meant to be run in a container (still runs in a namespace, but places almost no restrictions on that namespace), e.g. kubelet (kubo-release).
I've taken a look at the docs and don't think that this covers our use-case (but haven't tested it due to time constraints, sorry).
It would be great we didn't have to maintain our own tools to do this healthchecking!
HTTP health-checks is not something offered by systemd, the "systemd-way" would be sd_notify
which we could implement against if we decide to go with systemd.
To be clear: I'm not saying systemd is the only way, but I'm afraid bpm in its current form is not able to address all use-cases adequately and if we are making a breaking change anyway we might want to look for well established tools instead of creating our own.
Yeah, I'm not in favor of creating a process monitor. That's a "solved" problem and considering how painful most of the solutions are, it's not an easy problem to solve.
One thing that BPM does bring is a generic yaml structure for describing how to run a bosh job. I think that would be a good starting point for a monit replacement. Since a lot of bosh jobs already have a bpm.yml file it would be easy for them to migrate.
The agent could parse that file and then create a systemd (or other) config to run the job. There could still be a flag for bpm: true
to enable containerization and use bpm to isolate the job.
What we don't want to do is "monit 2.0" where we have release authors create systemd service files that we then take and give to systemd directly. We should have a translation layer so we have flexibility to change the process runner/monitor at a later date.
I've taken a look at the docs and don't think that this covers our use-case (but haven't tested it due to time constraints, sorry).
BPM is evolutive for any use-case that is not covered yet. We've done so in the past, and I've personally participated a couple of those evolutions: one for the JVM native code generator that required execution ability (we've added the allow_executions
option to volumes), and a second for the Postgres daemon which requires an INT signal (instead of TERM) for a clean exit (that's when we've added the shutdown_signal
option).
So @maxmoehl, I really encourage you to test the unsafe.privileged: true
option in your haproxy-boshrelease, because it basically disables any namespacing, so that you benefit from the BPM process description in YAML, without having any insolation in the way.
In the event we would isolate an unsupported use-case, then I'm pretty sure we can get BPM to evolve and support it!
BPM accommodates much workloads, but some that use cgroups like ContainerD will need extra shims when run on stemcells that use SystemD.
I'll give a talk tomorrow at CF day EU, on a pattern I've come up with for running ContainerD with BPM.
Not all stemcells run SystemD, as Bosh Lite stemcells are an exception to this. Would we build something where SystemD is a strong requirement, we may break Bosh Lite and that would be sad.
I know @rkoster recently had some success running SystemD as part of a docker-cpi stemcell. That same might be applicable to running on the warden-cpi too.
But if we have an abstract layer between the bosh release and the process manager, we could always have the bosh-agent have multiple process manager implementations and build one that still works in bosh-lite. It might lag behind, but it should be a possibility.