supervised-installer
supervised-installer copied to clipboard
Bug Report: Supervised system is reported as Unhealthy system - not privileged every time after supervisor upgrade.
OS Version
Debian GNU/Linux 12 (bookworm)
System Information
Linux ha 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
What happened?
Whenever the supervisor is updated, the Unhealthy system - Not privileged. Resolving this is easy, all I need to do is to restart the supervisor service from the OS prompt - I actually have a cron job that does it one a day.
I've been reading the threads, and I was not able to pinpointed the root cause of this issue.
This bug report has 2 goals:
- If there is a known root cause, one that I can test for, idneitfy it and remove it - I would love to know.
- I would really want a sensor stating whether the system is healthy or not, then I can automate log collection, and probably workaround the issue, also be able to pinpoint the exact set of curcumstances this is happening.
The only method I have right now is:
docker logs hassio_supervisor 2>&1 \
| grep -q "System is running in an unhealthy state and needs manual intervention!" \
&& sudo systemctl restart hassio-supervisor.service
It works, but very crude. I put it in a cron job as this has become really annoying.
Machine Type
generic-x86-64
Installer output
N/A
Relevant log output
Dec 19 10:05:54 ha hassio_supervisor[621]: 23-12-19 10:05:54 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!
ADR
- [X] I have read through the ADR and have confirmed that my system is compliant with the requirements
- [X] I understand that if my system is found to not be compliant, my issue will be closed immediately without further investigation
Code of Conduct
- [X] I have read and agree to the Home Assistant Code of Conduct
The main issue this is being tracked under is https://github.com/home-assistant/supervisor/issues/4381 it appears to happen only under bookworm and no root cause has been identified yet, the only thing I can say for certain is it seems to happen whenever the container is recreated and resolves itself with a restart, until the next supervisor update when the container is recreated yet again.
thanks for confirming @ikifar2012.
So it seems, at least at this point, I would ask the maintainers for a system state that will report this unhealthy state - something that I can send an alert on, and track.
See home-assistant/supervisor#4839 for my investigation of what concerning-things seems to be happening here
The "short" version of that is
after a reboot with an existing hassio_supervisor container:
$ uptime
18:44:13 up 1 min, 1 user, load average: 1.65, 0.75, 0.28
$ docker logs hassio_supervisor 2>&1 | tail -2
24-01-25 18:44:04 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete
24-01-25 18:44:04 INFO (MainThread) [supervisor.host.manager] Host information reload completed
$ ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: true
supported: true
version: 2023.12.1
version_latest: 2023.12.1
$ docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
[apparmor=hassio-supervisor label=disable]
big "but"
docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
xargs pstree -p | grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
xargs sudo ps Z
(or if you prefer / don't want to install pstree, the single-depth: docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid
gives
LABEL PID TTY STAT TIME COMMAND
unconfined 2503 ? Ss 0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
unconfined 2638 ? S 0:00 s6-supervise s6-linux-init-shutdownd
unconfined 2641 ? Ss 0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
unconfined 2666 ? S 0:00 s6-supervise s6rc-fdholder
unconfined 2667 ? S 0:00 s6-supervise s6rc-oneshot-runner
unconfined 2680 ? Ss 0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
unconfined 2722 ? S 0:00 s6-supervise supervisor
unconfined 2724 ? S 0:00 s6-supervise watchdog
unconfined 2726 ? Ss 0:00 bash /usr/bin/bashio ./run watchdog
unconfined 2727 ? Ssl 0:06 python3 -m supervisor
unconfined 7424 ? S 0:00 sleep 300
that is the apparmor profile is loaded BUT NOT APPLIED to any supervisor process
but after supervisor upgrade
(or any other event which causes recreation of hassio_supervisor container on a running system)
odroidn2:~:# docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
xargs pstree -p | grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
xargs ps Z
LABEL PID TTY STAT TIME COMMAND
hassio-supervisor (enforce) 9035 ? Ss 0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
hassio-supervisor (enforce) 9077 ? S 0:00 s6-supervise s6-linux-init-shutdownd
hassio-supervisor (enforce) 9079 ? Ss 0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
hassio-supervisor (enforce) 9086 ? S 0:00 s6-supervise s6rc-fdholder
hassio-supervisor (enforce) 9087 ? S 0:00 s6-supervise s6rc-oneshot-runner
hassio-supervisor (enforce) 9095 ? Ss 0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
hassio-supervisor (enforce) 9135 ? S 0:00 s6-supervise supervisor
hassio-supervisor (enforce) 9137 ? S 0:00 s6-supervise watchdog
hassio-supervisor (enforce) 9138 ? Ssl 0:03 python3 -m supervisor
hassio-supervisor (enforce) 9142 ? Ss 0:00 bash /usr/bin/bashio ./run watchdog
hassio-supervisor (enforce) 9773 ? S 0:00 sleep 30
odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: false
supported: true
version: 2023.12.1
version_latest: 2023.12.1
odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3
24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed
24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged
odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep 'CRITICAL\|ERROR\|WARNING'
24-01-25 20:07:42 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.rauc] Host has no rauc support. OTA updates have been disabled.
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.agent] No OS-Agent support on the host. Some Host functions have been disabled.
24-01-25 20:07:52 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!
AppArmor profile now is in effect and not sufficient for (socket (2) as socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT) inside of) udev_monitor_new_from_netlink (3)
Thus there appear to be two important issues:
-
why is the apparmor profile as written not sufficient for this udev socket open on Debian 12
- (while remaining sufficient on HAOS ... one assumes/hopes).
- My best guess (which I should test, but haven't yet, is that its that
deny network rawo nline 8? ; that or the policy needs to be being applied to the actual pyc in pyudev and/or libudev.so not just the python interpretor? or there is a new capability needed?)
-
why is the apparmor profile not being applied to the container's proccesses on first-start after boot? (also no longer, for me, on siblings of ns pid1 spawned by
docker exec)-
(is this a docker/containerd.bug? a kernel bug? a systemd bug?
-
(or is it the initialization chain of hassio-apparmor being race-y? As I demonstrated in this comment it does at least seem like containerd will not start the container when the profile is not loaded - but I'm thinking maybe there is a race between allocation of the name and restore of its contents? [and that apparmor modules short circuit application of a profile if its found empty at the time?])
-
I think I ended up leaving it out of the version posted in there, but its perhaps worth noting that with the profile applied, pyudev.Context().list_devices() still works fine [that is, I believe: udenv_enumerate_new (3) and udev_list_entry_get_next(3)) in the supervisor process itself (with debug/debug_wait]
so the problem seems to be potentially less the udev part writ large then the (monitor_new_from_)netlink part in particular
(udevadm monitor in the supervisor container (as well as in the hassio_audio container under docker-default apparmor) still works - but that appears to be by virtue of the Ux v ix the policy designates it with)
speaking to OP's workaround:
For my part systemctl restart hassio-supervisor et al are not sufficient (at least not consistently[^1]) to get from "bad" to "good", only a full system reboot - but this may have changed at some point since I started encountering the issue
Conversely, while I was writing https://github.com/home-assistant/supervisor/issues/4839 I could have sworn at some point a "soft" restart was initally but not repeatably sufficient to get from "good" to "bad" (but that could, for example, be something like /run/supervisor/startup-marker causing a recreate instead of a docker stop / docker start)
[^1]: though one early time I was looking into this, the act of editing config.json or using ha supervisor options to set debug and debug_block and then bootstrapping my first supervisor vscode environment did appear to immediately sidestep the failure--launching it to full hesienbug status. I do not know if that suggests potentially an even more subtle timing issue (such as last supervisor TCP_WAIT-ish socket cleanup?) or if I just somewhere that time I e.g. rebooted the machine without really noting it as part of what I did.
but stopping my thread-spamming, I would be very curious whether other people experiencing this sort of issue can replicate (or countermand) my catching the processes unconstrained
(using docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid, the deeper pstree|sed version, or just aa-status/apparmor_status though there you are looking for presence/absence of the hassio-supervisor profile in the list of processes)
I just installed hassio supervised on a fresh debian (bookworm) system and happy to help debug, but much of this thread is over my head. What I can say with certainty is:
For my part
systemctl restart hassio-supervisoret al are not sufficient (at least not consistently) to get from "bad" to "good", only a full system reboot
This was also my experience. Happy to run any commands and provide their output for corroboration.
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.
https://github.com/home-assistant/version/pull/370 fixed https://github.com/home-assistant/supervisor/issues/4381. Is this issue fixed by the PR aswell?
I would fully expect so, but you know, it needs to roll out first
for what it’s worth I belatedly applied my suggested workaround of pulling the profile entirely and haven’t had an issue since, though technically that’s a different thing
home-assistant/version#370 fixed home-assistant/supervisor#4381. Is this issue fixed by the PR aswell?
Still awaiting this https://github.com/home-assistant/version/pull/372 to be merged as well... We will see after the next supervisor update
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.