FTL
FTL copied to clipboard
Add supervised mode
By submitting this pull request, I confirm the following:
- [X] I have read and understood the contributors guide.
- [X] I have checked that another pull request for this purpose does not exist.
- [X] I have considered, and confirmed that this submission will be valuable to others.
- [X] I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
- [X] I give this submission freely, and claim no ownership to its content.
How familiar are you with the codebase?:
10
Add supervision mode. FTL running as supervisor starts a child process which is automatically restarted in case of any abnormal termination including crashes and uncatchable signals such as SIGKILL.
This is also the first step into the direction of supporting updates via the web frontend as this allows FTL to restart itself without having to rely on a third-party.
Great idea!
What are the child processes (components) the supervisor can start individually? Can it e.g. restart the webserver without restarting dnsmasq?
@yubiuser It is one (full) FTL process that acts as a supervisor starting another (full) instance of FTL. It is neither planned nor practical doable to start individual parts within FTL and launch them as individual components. This PR implements something like the restart-on-failure of systemd (with a bit of more fine-grained control).
Moving the termination into the tests (to check if the supervisor behaves as expected) revealed a crash happening only on musl. On our "normal" builds, canceling a non-existing thread is a simple no-op whereas, on musl, this leads to a segmentation fault. The fix is to avoid canceling non-existing threads (3c38420/src/daemon.c#R234-R237).
Is there a mechanism which prevents an endless restart loop? I cannot see it in code currently. Probably a restart limit, optionally adjustable with a CLI option, makes sense?
There is a forced delay of 1sec between restarts to avoid CPU spinning by endless restarts in a very short time. Others than that, there is no counter of recently failed attempts. Is there such a limit with systemd and it's Restart=on-failure feature? Even when it was easy, I would still like to avoid reinventing the wheel and use a proper systemd unit with said Restart=on-failure instead of this PR.
https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=
My recollection is that systemd will only try 5 times by default and then journal an error that says "retried too many times" or "tried to restart too quickly". There are limits that are discussed above as well as alternatives to 'on-failure'.
The related (re)start limits can be set in the [Unit] section: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#StartLimitIntervalSec=interval
I see, I need to start doing the systemd integration 😅. I'm sorry for not doing so far what I promised, new full-time job has reduced my space time dramatically.
I plan to look at another systemd attempt sometime next week. Maybe the busy two of us manage to do it together :slightly_smiling_face:
This pull request has conflicts, please resolve those before we can evaluate the pull request.
Superseded by native systemd unit ( https://github.com/pi-hole/pi-hole/pull/4924 )