operating-system icon indicating copy to clipboard operation
operating-system copied to clipboard

HAOS 11.4 upgrade failing on HA Yellow with Docker error: "Failed to initialize nft"

Open asayler opened this issue 1 year ago • 11 comments
trafficstars

Describe the issue you are experiencing

I've been unable to update my HA Yellow (Raspberry Pi CM4) from HAOS 11.3 to 11.4. Each time I run the update, the system fails to complete the 11.4 boot and then fails back to the 11.3 boot slot. I originally thought this was related to #2870, but digging into it some more, it seems to be a unique issue. See full logs before, but looking at the failed boot logs, it seems to be docker is failing to start due to a iptables -t nat -N DOCKER: iptables: Failed to initialize nft: Protocol not supported error:

Jan 15 06:40:49 ha systemd[1]: Starting Docker Application Container Engine...
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.044616804Z" level=info msg="Starting up"
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.045878137Z" level=warning msg="Running experimental build"
Jan 15 06:40:50 ha audit[478]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=478 comm="apparmor_parser"
Jan 15 06:40:50 ha audit[478]: SYSCALL arch=c00000b7 syscall=64 success=yes exit=8369 a0=4 a1=556f2adc80 a2=20b1 a3=1 items=0 ppid=477 pid=478 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="apparmor_parser" exe="/usr/sbin/apparmor_parser" subj=unconfined key=(null)
Jan 15 06:40:50 ha audit: PROCTITLE proctitle=61707061726D6F725F706172736572002D4B72002F6D6E742F646174612F646F636B65722F746D702F646F636B65722D64656661756C7434303733333334353835
Jan 15 06:40:50 ha kernel: audit: type=1400 audit(1705300850.108:15): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=478 comm="apparmor_parser"
Jan 15 06:40:50 ha kernel: audit: type=1300 audit(1705300850.108:15): arch=c00000b7 syscall=64 success=yes exit=8369 a0=4 a1=556f2adc80 a2=20b1 a3=1 items=0 ppid=477 pid=478 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="apparmor_parser" exe="/usr/sbin/apparmor_parser" subj=unconfined key=(null)
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.163582137Z" level=info msg="[graphdriver] trying configured driver: overlay2"
Jan 15 06:40:50 ha systemd[1]: var-lib-docker-overlay2-metacopy\x2dcheck3120273736-merged.mount: Deactivated successfully.
Jan 15 06:40:50 ha systemd[1]: mnt-data-docker-overlay2-metacopy\x2dcheck3120273736-merged.mount: Deactivated successfully.
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.454632970Z" level=info msg="Loading containers: start."
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.532171767Z" level=info msg="unable to detect if iptables supports xlock: 'iptables --wait -L -n': `iptables: Failed to initialize nft: Protocol not supported`" error="exit status 1"
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.704662859Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.706588729Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Jan 15 06:40:50 ha dockerd[468]: failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to create NAT chain DOCKER: iptables failed: iptables -t nat -N DOCKER: iptables: Failed to initialize nft: Protocol not supported
Jan 15 06:40:50 ha dockerd[468]:  (exit status 1)
Jan 15 06:40:50 ha systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 06:40:50 ha docker-failure[521]: Docker exited with exit status 1, this might be caused by corrupted key.json.
Jan 15 06:40:50 ha docker-failure[522]: stat: can't stat '/mnt/overlay/etc/docker/key.json': No such file or directory
Jan 15 06:40:50 ha docker-failure[521]: key.json:  bytes
Jan 15 06:40:50 ha docker-failure[521]: /usr/libexec/docker-failure: line 7: can't open /mnt/overlay/etc/docker/key.json: no such file
Jan 15 06:40:50 ha docker-failure[521]: key.json appears to be corrupted, it is not parsable. Removing it.
Jan 15 06:40:50 ha systemd[1]: docker.service: Failed with result 'exit-code'.
Jan 15 06:40:50 ha systemd[1]: Failed to start Docker Application Container Engine.
Jan 15 06:40:50 ha systemd[1]: Dependency failed for HassOS supervisor.
Jan 15 06:40:50 ha systemd[1]: hassos-supervisor.service: Job hassos-supervisor.service/start failed with result 'dependency'.

See full logs below.

What operating system image do you use?

yellow

What version of Home Assistant Operating System is installed?

11.4

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. From a terminal on the HA Yellow running HAOS 11.3, run ha os update
  2. Wait for the update to complete successfully and for the system to reboot
  3. On reboot, watch the serial terminal and observe the boot process failing at the docker startup step with the "Failed to initialize nft: Protocol not supported" error.
  4. Wait for the boot process to fail three times, at which point the system will switch back over to the 11.3 bootslot and boot correctly.

Anything in the Supervisor logs that might be useful for us?

Failed docker log:

Jan 15 06:40:49 ha systemd[1]: Starting Docker Application Container Engine...
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.044616804Z" level=info msg="Starting up"
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.045878137Z" level=warning msg="Running experimental build"
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.163582137Z" level=info msg="[graphdriver] trying configured driver: overlay2"
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.454632970Z" level=info msg="Loading containers: start."
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.532171767Z" level=info msg="unable to detect if iptables supports xlock: 'iptables --wait -L -n': `iptables: Failed to initialize nft: Protocol not supported`" error="exit status 1"
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.704662859Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Jan 15 06:40:50 ha dockerd[468]: time="2024-01-15T06:40:50.706588729Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Jan 15 06:40:50 ha dockerd[468]: failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to create NAT chain DOCKER: iptables failed: iptables -t nat -N DOCKER: iptables: Failed to initialize nft: Protocol not supported
Jan 15 06:40:50 ha dockerd[468]:  (exit status 1)
Jan 15 06:40:50 ha systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 06:40:50 ha docker-failure[521]: Docker exited with exit status 1, this might be caused by corrupted key.json.
Jan 15 06:40:50 ha docker-failure[522]: stat: can't stat '/mnt/overlay/etc/docker/key.json': No such file or directory
Jan 15 06:40:50 ha docker-failure[521]: key.json:  bytes
Jan 15 06:40:50 ha docker-failure[521]: /usr/libexec/docker-failure: line 7: can't open /mnt/overlay/etc/docker/key.json: no such file
Jan 15 06:40:50 ha docker-failure[521]: key.json appears to be corrupted, it is not parsable. Removing it.
Jan 15 06:40:50 ha systemd[1]: docker.service: Failed with result 'exit-code'.
Jan 15 06:40:50 ha systemd[1]: Failed to start Docker Application Container Engine.

Anything in the Host logs that might be useful for us?

I'll attach full boot logs in a comment.

System information

System Information

version core-2023.11.3
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.11.6
os_name Linux
os_version 6.1.63-haos-raspi
arch aarch64
timezone America/Denver
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 4939
Installed Version 1.33.0
Stage running
Available Repositories 1376
Downloaded Repositories 8
Home Assistant Cloud
logged_in true
subscription_expiration May 22, 2024 at 6:00 PM
relayer_connected true
relayer_region us-east-1
remote_enabled true
remote_connected true
alexa_enabled false
google_enabled true
remote_server us-east-1-7.ui.nabu.casa
certificate_status ready
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Home Assistant OS 11.3
update_channel stable
supervisor_version supervisor-2023.12.0
agent_version 1.6.0
docker_version 24.0.7
disk_total 457.7 GB
disk_used 15.0 GB
healthy true
supported true
board yellow
supervisor_api ok
version_api ok
installed_addons Let's Encrypt (5.0.9), NGINX Home Assistant SSL proxy (3.7.0), File editor (5.7.0), Terminal & SSH (9.8.1), Z-Wave JS UI (3.1.0), Uptime Kuma (0.12.0), AWNET to HASS (1.0.1), ESPHome (2023.12.5), Silicon Labs Flasher (0.2.0)
Dashboards
dashboards 4
resources 0
views 3
mode storage
Recorder
oldest_recorder_run January 5, 2024 at 1:42 AM
current_recorder_run January 14, 2024 at 11:42 PM
estimated_db_size 1047.00 MiB
database_engine sqlite
database_version 3.41.2

Additional information

No response

asayler avatar Jan 15 '24 07:01 asayler

Full failed boot log: last-boot-full.log Failed boot serial terminal output: ha_boot_11.4_failed.txt

asayler avatar Jan 15 '24 07:01 asayler

Hm, I am running HAOS 11.4 successfully here on two Yellow's, so this must be something related to your particular instance/installation.

The best which comes to mind is that the kernel gets loaded from a different partition then what is mounted later on as rootfs. The failed log support this theory, in your case the kernel is built on Dec 4:

[    0.000000] Linux version 6.1.58-haos-raspi (builder@5b1a6501bc4e) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot -g2b699621) 11.4.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Mon Dec  4 15:51:47 UTC 023

While my 11.4 installation show a kernel build in January:

[    0.000000] Linux version 6.1.63-haos-raspi (builder@13ed6d6d8021) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot -g2d89a0f9) 11.4.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Tue Jan  9 10:42:51 UTC 2024

Did you install HAOS directly to the NVMe? In that case it could be that you have a stale installation on your eMMC, and the system is now mixing up the two.

I'd recommend to take a full backup and download it as long as your system comes up. From there a reinstall using Option 1 documented https://yellow.home-assistant.io/guides/reinstall-os/ is probably the best choice. It makes sure the eMMC is cleared. If you want your system to boot from NVMe, make sure to press the Blue button in step 9.

agners avatar Jan 15 '24 12:01 agners

Yes, this is an nvme-based install. Happy to try a reinstall, but it seems like there's still a bug here if this breaks for anyone using an nvme install. Any further debugging I should do to try to debug this prior to reinstalling?

On Mon, Jan 15, 2024, 05:07 Stefan Agner @.***> wrote:

Hm, I am running HAOS 11.4 successfully here on two Yellow's, so this must be something related to your particular instance/installation.

The best which comes to mind is that the kernel gets loaded from a different partition then what is mounted later on as rootfs. The failed log support this theory, in your case the kernel is built on Dec 4:

[ 0.000000] Linux version 6.1.58-haos-raspi @.***) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot -g2b699621) 11.4.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Mon Dec 4 15:51:47 UTC 023

While my 11.4 installation show a kernel build in January:

[ 0.000000] Linux version 6.1.63-haos-raspi @.***) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot -g2d89a0f9) 11.4.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Tue Jan 9 10:42:51 UTC 2024

Did you install HAOS directly to the NVMe? In that case it could be that you have a stale installation on your eMMC, and the system is now mixing up the two.

I'd recommend to take a full backup and download it as long as your system comes up. From there a reinstall using Option 1 documented https://yellow.home-assistant.io/guides/reinstall-os/ is probably the best choice. It makes sure the eMMC is cleared. If you want your system to boot from NVMe, make sure to press the Blue button in step 9 https://yellow.home-assistant.io/power-supply/#connecting-the-power-supply .

— Reply to this email directly, view it on GitHub https://github.com/home-assistant/operating-system/issues/3074#issuecomment-1892044589, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACLHPIFDQLM4CVI5R4MZP3YOULWLAVCNFSM6AAAAABB2ZRYASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJSGA2DINJYHE . You are receiving this because you authored the thread.Message ID: @.***>

asayler avatar Jan 15 '24 16:01 asayler

One of my two Yellow installation is an NVMe install, and that did not break. It is related to a "stale" install on your eMMC, which HAOS doesn't like much. The problem is that we rely on PARTUUID to find the correct root partition, and this is static in every install.

Looking at the logs again, this snippet also really shows that case:

Apr 04 10:55:23 homeassistant kernel: printk: console [ttyAMA2] enabled
Apr 04 10:55:23 homeassistant kernel:  nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8
Apr 04 10:55:23 homeassistant kernel: mmc0: new DDR MMC card at address 0001
Apr 04 10:55:23 homeassistant kernel: mmc1: new high speed SDIO card at address 0001
Apr 04 10:55:23 homeassistant kernel: mmcblk0: mmc0:0001 BJTD4R 29.1 GiB 
Apr 04 10:55:23 homeassistant kernel: printk: console [netcon0] enabled
Apr 04 10:55:23 homeassistant kernel:  mmcblk0: p1 p2 p3 p4 p5 p6 p7 p8

Our update system as well as U-Boot relies on the partition UUID and labels to be unique in the system. So this setup is bound to cause issue long term. What we probably could do is check the partition from Supervisor side and warn the user if this type of setup is found. :thinking:

agners avatar Jan 15 '24 16:01 agners

@asayler if you have serial console access, the output of this command would actually be interesting:

blkid --match-token PARTLABEL="hassos-system0" --output device

Also, you can check which partition gets used as data partition using:

findmnt /mnt/data/

It seems the system boots from NVMe, so if you are certain your data is on NVMe too, you can try to fix the problem "manually" by deleting the eMMC installation

:warning: this might break your system, make a backup before!

sync
blkdiscard /dev/mmcblk0
dd if=/dev/zero of=/dev/mmcblk0 bs=1M count=32
reboot -f

agners avatar Jan 15 '24 16:01 agners

Thanks, @agners. Here's the result of those commands:

# blkid --match-token PARTLABEL="hassos-system0" --output device
/dev/nvme0n1p3
/dev/mmcblk0p3
# blkid --match-token PARTLABEL="hassos-data" --output device
/dev/nvme0n1p8
/dev/mmcblk0p8
# findmnt /mnt/data/
TARGET    SOURCE         FSTYPE OPTIONS
/mnt/data /dev/nvme0n1p8 ext4   rw,relatime,commit=30
# ls -al /dev/disk/by-partlabel/
total 0
drwxr-xr-x    2 root     root           200 Jan 15 07:53 .
drwxr-xr-x    9 root     root           180 Apr  4  2023 ..
lrwxrwxrwx    1 root     root            15 Jan 15 07:53 hassos-boot -> ../../nvme0n1p1
lrwxrwxrwx    1 root     root            15 Jan 15 07:53 hassos-bootstate -> ../../nvme0n1p6
lrwxrwxrwx    1 root     root            15 Jan 15 07:53 hassos-data -> ../../mmcblk0p8
lrwxrwxrwx    1 root     root            15 Jan 15 07:53 hassos-kernel0 -> ../../nvme0n1p2
lrwxrwxrwx    1 root     root            15 Jan 15 07:53 hassos-kernel1 -> ../../nvme0n1p4
lrwxrwxrwx    1 root     root            15 Jan 15 07:53 hassos-overlay -> ../../nvme0n1p7
lrwxrwxrwx    1 root     root            15 Jan 15 07:53 hassos-system0 -> ../../nvme0n1p3
lrwxrwxrwx    1 root     root            15 Jan 15 07:53 hassos-system1 -> ../../nvme0n1p5

So it does seem like there are conflicting partition labels on the nvme and mmc. I was pretty sure I followed the directions you note when I set this up initially to force the nvme install on a fresh CM4, but it's possible I completed an install onto the eMMC prior to doing that (this was installed about 10 months ago, so I've forgotten the details). I wonder why the install scripts failed to wipe the eMMC when performing the nvme install? Seems like that was maybe the original issue here.

I'll attempt to wipe the eMMC partitions shortly (although I assume I could also just change their partition labels to avoid the naming conflict) to see if that resolves the issue. If that fails, I'll do a full reinstall.

asayler avatar Jan 15 '24 18:01 asayler

And just to give the full picture, data seems to be on nvme, but the overlay is running off eMMC:

# lsblk
\NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
mmcblk0      179:0    0  29.1G  0 disk 
|-mmcblk0p1  179:1    0    32M  0 part /mnt/boot
|-mmcblk0p2  179:2    0    24M  0 part 
|-mmcblk0p3  179:3    0   256M  0 part 
|-mmcblk0p4  179:4    0    24M  0 part 
|-mmcblk0p5  179:5    0   256M  0 part 
|-mmcblk0p6  179:6    0     8M  0 part 
|-mmcblk0p7  179:7    0    96M  0 part /var/lib/systemd
|                                      /var/lib/bluetooth
|                                      /var/lib/NetworkManager
|                                      /etc/systemd/timesyncd.conf
|                                      /etc/hosts
|                                      /etc/hostname
|                                      /etc/NetworkManager/system-connections
|                                      /root/.ssh
|                                      /root/.docker
|                                      /etc/udev/rules.d
|                                      /etc/modules-load.d
|                                      /etc/modprobe.d
|                                      /etc/dropbear
|                                      /mnt/overlay
`-mmcblk0p8  179:8    0  28.4G  0 part 
mmcblk0boot0 179:32   0     4M  1 disk 
mmcblk0boot1 179:64   0     4M  1 disk 
zram0        254:0    0     0B  0 disk 
zram1        254:1    0    32M  0 disk 
zram2        254:2    0    16M  0 disk /tmp
nvme0n1      259:0    0 465.8G  0 disk 
|-nvme0n1p1  259:1    0    32M  0 part 
|-nvme0n1p2  259:2    0    24M  0 part 
|-nvme0n1p3  259:3    0   256M  0 part 
|-nvme0n1p4  259:4    0    24M  0 part 
|-nvme0n1p5  259:5    0   256M  0 part /
|-nvme0n1p6  259:6    0     8M  0 part 
|-nvme0n1p7  259:7    0    96M  0 part 
`-nvme0n1p8  259:8    0 465.1G  0 part /var/log/journal
                                       /var/lib/docker
                                       /mnt/data
# mount | grep mmc
/dev/mmcblk0p1 on /mnt/boot type vfat (rw,relatime,sync,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro)
/dev/mmcblk0p7 on /mnt/overlay type ext4 (rw,relatime)
/dev/mmcblk0p7 on /etc/dropbear type ext4 (rw,relatime)
/dev/mmcblk0p7 on /etc/modprobe.d type ext4 (rw,relatime)
/dev/mmcblk0p7 on /etc/modules-load.d type ext4 (rw,relatime)
/dev/mmcblk0p7 on /etc/udev/rules.d type ext4 (rw,relatime)
/dev/mmcblk0p7 on /root/.docker type ext4 (rw,relatime)
/dev/mmcblk0p7 on /root/.ssh type ext4 (rw,relatime)
/dev/mmcblk0p7 on /etc/NetworkManager/system-connections type ext4 (rw,relatime)
/dev/mmcblk0p7 on /etc/hostname type ext4 (rw,relatime)
/dev/mmcblk0p7 on /etc/hosts type ext4 (rw,relatime)
/dev/mmcblk0p7 on /etc/systemd/timesyncd.conf type ext4 (rw,relatime)
/dev/mmcblk0p7 on /var/lib/NetworkManager type ext4 (rw,relatime)
/dev/mmcblk0p7 on /var/lib/bluetooth type ext4 (rw,relatime)
/dev/mmcblk0p7 on /var/lib/systemd type ext4 (rw,relatime)
# mount | grep nvme
/dev/nvme0n1p5 on / type squashfs (ro,relatime,errors=continue)
/dev/nvme0n1p8 on /mnt/data type ext4 (rw,relatime,commit=30)
/dev/nvme0n1p8 on /var/lib/docker type ext4 (rw,relatime,commit=30)
/dev/nvme0n1p8 on /var/log/journal type ext4 (rw,relatime,commit=30)

asayler avatar Jan 15 '24 18:01 asayler

I'm running HAOS on an Intel NUC. Everything was running well (11.3) until the upgrade to 11.4, after which I could not access the homeassistant.local port. Viewing the NUC locally revealed a cifs_mount failed w/return code = -101 error, which means the OS starts before the network connection is established. Tried several 11.4 reinstalls to no avail - same issue after each reboot. Note that HAOS 11.4 appears to boot correctly but without network access. I did some research and saw suggestions about disabling IPV6, or modifying config files to add a delay to wait for an established network, but I've never had to worry about that prior to 11.4, and don't need want to worry about it now. I'll wait for the HA team 11.4.1 release which fixes their mistake. I reinstalled 11.3 and restored my last daily backup. It's nice to be back in Home Automation Heaven...

dwgtx avatar Jan 15 '24 22:01 dwgtx

@dwgtx your case seems a networking issue on your particular platform with HAOS 11.4. Can you open a new issue along with detailed information about your system (model number of your hardware, network card information).

agners avatar Jan 16 '24 08:01 agners

@asayler

I wonder why the install scripts failed to wipe the eMMC when performing the nvme install? Seems like that was maybe the original issue here.

Yeah the installer indeed should wipe the eMMC: https://github.com/NabuCasa/buildroot-installer/blob/2022.02.x-yellow-installer/rootfs-overlay/usr/bin/haos-flash#L43-L48

I was assuming that you pulled out the NVMe and installed direclty (or used the rpiboot method to expose the NVMe to as mass storage device to your computer and flashed it directly or something).

I'll attempt to wipe the eMMC partitions shortly (although I assume I could also just change their partition labels to avoid the naming conflict) to see if that resolves the issue. If that fails, I'll do a full reinstall.

You'd have to change the partition label of all of them. Also at boot we use the UUID, so you'd have to change that too. Just removing the whole partition table is really the easy way out here. :sweat_smile:

agners avatar Jan 16 '24 09:01 agners

I went ahead and wiped the partition table on the eMMC storage (the blkdiscard command failed since the device was mounted/busy but the dd coupled with a reboot worked). The system came back up on 11.4 and now seems to be (correctly) mounting all storage off the nvme device:

# mount | grep mmc
# mount | grep nvm
/dev/nvme0n1p3 on / type squashfs (ro,relatime,errors=continue)
/dev/nvme0n1p1 on /mnt/boot type vfat (rw,relatime,sync,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro)
/dev/nvme0n1p7 on /mnt/overlay type ext4 (rw,relatime)
/dev/nvme0n1p7 on /etc/dropbear type ext4 (rw,relatime)
/dev/nvme0n1p7 on /etc/modprobe.d type ext4 (rw,relatime)
/dev/nvme0n1p7 on /etc/modules-load.d type ext4 (rw,relatime)
/dev/nvme0n1p7 on /etc/udev/rules.d type ext4 (rw,relatime)
/dev/nvme0n1p7 on /root/.docker type ext4 (rw,relatime)
/dev/nvme0n1p7 on /root/.ssh type ext4 (rw,relatime)
/dev/nvme0n1p7 on /etc/NetworkManager/system-connections type ext4 (rw,relatime)
/dev/nvme0n1p7 on /etc/hostname type ext4 (rw,relatime)
/dev/nvme0n1p7 on /etc/hosts type ext4 (rw,relatime)
/dev/nvme0n1p7 on /etc/systemd/timesyncd.conf type ext4 (rw,relatime)
/dev/nvme0n1p8 on /mnt/data type ext4 (rw,relatime,commit=30)
/dev/nvme0n1p7 on /var/lib/NetworkManager type ext4 (rw,relatime)
/dev/nvme0n1p7 on /var/lib/bluetooth type ext4 (rw,relatime)
/dev/nvme0n1p8 on /var/lib/docker type ext4 (rw,relatime,commit=30)
/dev/nvme0n1p7 on /var/lib/systemd type ext4 (rw,relatime)
/dev/nvme0n1p8 on /var/log/journal type ext4 (rw,relatime,commit=30)

There do still seem to be two mmc boot devices present, not sure if that's expected or not:

NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
mmcblk0      179:0    0  29.1G  0 disk
mmcblk0boot0 179:32   0     4M  1 disk
mmcblk0boot1 179:64   0     4M  1 disk
zram0        254:0    0     0B  0 disk
zram1        254:1    0    32M  0 disk
zram2        254:2    0    16M  0 disk /tmp
nvme0n1      259:0    0 465.8G  0 disk
|-nvme0n1p1  259:1    0    32M  0 part /mnt/boot
|-nvme0n1p2  259:2    0    24M  0 part
|-nvme0n1p3  259:3    0   256M  0 part /
|-nvme0n1p4  259:4    0    24M  0 part
|-nvme0n1p5  259:5    0   256M  0 part
|-nvme0n1p6  259:6    0     8M  0 part
|-nvme0n1p7  259:7    0    96M  0 part /var/lib/systemd
|                                      /var/lib/bluetooth
|                                      /var/lib/NetworkManager
|                                      /etc/systemd/timesyncd.conf
|                                      /etc/hosts
|                                      /etc/hostname
|                                      /etc/NetworkManager/system-connections
|                                      /root/.ssh
|                                      /root/.docker
|                                      /etc/udev/rules.d
|                                      /etc/modules-load.d
|                                      /etc/modprobe.d
|                                      /etc/dropbear
|                                      /mnt/overlay
`-nvme0n1p8  259:8    0 465.1G  0 part /var/log/journal
                                       /var/lib/docker
                                       /mnt/data

But otherwise things now seem to be functioning correctly.

I'll likely still do a full reinstall soon since I want to swap out the CM4 with a higher RAM version and replace the SSD while I'm at it. But I do at least seem to be back to a stable boot state. Not sure if you want to keep this ticket open to add any protections against this edge case going forward, but I appreciate the insight into what was going on here.

asayler avatar Jan 19 '24 02:01 asayler

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 18 '24 05:04 github-actions[bot]

Closing as it is resolved.

agners avatar Apr 18 '24 07:04 agners