operating-system icon indicating copy to clipboard operation
operating-system copied to clipboard

external data disk gets disabled on HAOS update in UEFI (arm64) VM - machine_id found empty

Open claplace opened this issue 1 year ago • 11 comments
trafficstars

Describe the issue you are experiencing

After the recent update to 12.1, the system got stuck during boot. This time I took time to fetch the journalctl and dmesg outputs.

My external data disk was labelled "hassos-data-dis". I was able to change it to "hassos-data" and reboot successfully.

Looking at existing issues, I thing the root cause is that the boot immediatly after the operating system update got an empty machine_id value:

homeassistant kernel: Kernel command line: BOOT_IMAGE=(hd0,gpt2)/Image root=PARTUUID=8d3d53e3-6d49-4c38-8349-aff6859e82fd rootwait zram.enabled=1 zram.num_devices=3 systemd.machine_id= fsck.repair=yes systemd.condition-first-boot=true console=tty1 console=ttyS0 rauc.slot=A

This caused the haos-data-disk-detach script to run:

homeassistant systemd[1]: Starting HAOS data disk detach...

and that got the data disk disabled.

It's the second time it happens (since I moved the data on a second disk), and might be related to the update procedure for UEFI systems. I see the new machine_id in the /mnt/boot/efi/boot/grubenv.

journalctl.txt dmesg.txt

What operating system image do you use?

generic-aarch64 (Generic UEFI capable aarch64 systems)

What version of Home Assistant Operating System is installed?

Home Assistant OS 12.1

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

Started 12.0 -> 12.1 update.

Anything in the Supervisor logs that might be useful for us?

no

Anything in the Host logs that might be useful for us?

no

System information

System Information

version core-2024.3.0
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.12.2
os_name Linux
os_version 6.6.20-haos
arch aarch64
timezone America/New_York
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 4937
Installed Version 1.34.0
Stage running
Available Repositories 1411
Downloaded Repositories 10
Home Assistant Cloud
logged_in true
subscription_expiration November 27, 2024 at 7:00 PM
relayer_connected true
relayer_region us-east-1
remote_enabled true
remote_connected true
alexa_enabled false
google_enabled true
remote_server us-east-1-4.ui.nabu.casa
certificate_status ready
instance_id d135c865502f446fa7746e274daf1f76
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Home Assistant OS 12.1
update_channel stable
supervisor_version supervisor-2024.03.0
agent_version 1.6.0
docker_version 24.0.7
disk_total 125.9 GB
disk_used 23.7 GB
healthy true
supported true
board generic-aarch64
supervisor_api ok
version_api ok
installed_addons ZeroTier One (0.18.0), Terminal & SSH (9.10.0), File editor (5.8.0), ESPHome (2024.2.2), Z-Wave JS (0.4.5), Matter Server (5.4.1)
Dashboards
dashboards 1
resources 3
views 8
mode storage
Recorder
oldest_recorder_run March 6, 2024 at 11:33 PM
current_recorder_run March 13, 2024 at 7:01 PM
estimated_db_size 268.23 MiB
database_engine sqlite
database_version 3.44.2

Additional information

No response

claplace avatar Mar 13 '24 23:03 claplace

I suspect that I have the same bug on the RPi5 with SSD, but I can no longer access any logs. In my case, only a reinstallation with the implementation of the backup helped.

chris0607 avatar Mar 14 '24 08:03 chris0607

So this is not specific to "UEFI (arm64) VM" when others see this on Raspberry Pi's too, right?

bcutter avatar Apr 02 '24 11:04 bcutter

@claplace I agree with your analysis, it seems that first boot got detected again.

What usually is the cause is if a internal SD card as well as a external SD card/disk have a full HAOS installation (with the boot partition etc.) on it. Then the system might write the boot information on a different disk than the boot loader reads things from.

Can you run lsblk from the terminal to check what disks and partitions there are?

agners avatar Apr 02 '24 19:04 agners

Sure :)

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0   32G  0 disk 
|-sda1   8:1    0   32M  0 part /mnt/boot
|-sda2   8:2    0   24M  0 part 
|-sda3   8:3    0  256M  0 part /
|-sda4   8:4    0   24M  0 part 
|-sda5   8:5    0  256M  0 part 
|-sda6   8:6    0    8M  0 part 
|-sda7   8:7    0   96M  0 part /var/lib/systemd
|                               /var/lib/bluetooth
|                               /var/lib/NetworkManager
|                               /root/.ssh
|                               /root/.docker
|                               /etc/udev/rules.d
|                               /etc/systemd/timesyncd.conf
|                               /etc/modules-load.d
|                               /etc/modprobe.d
|                               /etc/hosts
|                               /etc/hostname
|                               /etc/dropbear
|                               /etc/NetworkManager/system-connections
|                               /mnt/overlay
`-sda8   8:8    0 31.3G  0 part 
sdb      8:16   0  128G  0 disk 
`-sdb1   8:17   0  128G  0 part /var/log/journal
                                /var/lib/docker
                                /mnt/data
zram0  253:0    0    0B  0 disk 
zram1  253:1    0   32M  0 disk 
zram2  253:2    0   16M  0 disk /tmp

claplace avatar Apr 02 '24 20:04 claplace

also, fdisk -l /dev/sda:

Disk model: VMware Virtual S
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BB18D444-28F2-4177-B30F-AB734183BA40

Device       Start      End  Sectors  Size Type
/dev/sda1     2048    67583    65536   32M EFI System
/dev/sda2    67584   116735    49152   24M Linux filesystem
/dev/sda3   116736   641023   524288  256M Linux filesystem
/dev/sda4   641024   690175    49152   24M Linux filesystem
/dev/sda5   690176  1214463   524288  256M Linux filesystem
/dev/sda6  1214464  1230847    16384    8M Linux filesystem
/dev/sda7  1230848  1427455   196608   96M Linux filesystem
/dev/sda8  1427456 67108830 65681375 31.3G Linux filesystem

and for /dev/sdb:

Disk /dev/sdb: 128 GiB, 137438953472 bytes, 268435456 sectors
Disk model: VMware Virtual S
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 97609BB7-B00A-4DBD-911B-DE58B5BC9100

Device     Start       End   Sectors  Size Type
/dev/sdb1   2048 268433407 268431360  128G Linux filesystem

claplace avatar Apr 02 '24 20:04 claplace

Hi, just a note to tell I've hit the issue again with 12.2 update.

fixed with:

e2label /dev/sdb1 hassos-data
reboot

claplace avatar Apr 12 '24 00:04 claplace

The update to 12.2 worked on the Pi 5 with PCIe SSD. Perhaps it was not the same error then. In my defense, I never got to the logs, I always had to reinstall, in my 3 tests.

chris0607 avatar Apr 12 '24 08:04 chris0607

Hi, just a note to tell I've hit the issue again with 12.2 update.

I've tried to reproduce this on generic-aarch64, but wasn't able to: For me the machine ID got saved, and on successive boot it was present in the cmdline.

@claplace can you check the logs of hassos-persists?

journalctl -u hassos-persists

And check if the GRUB environment is ok?

grub-editenv /mnt/boot/EFI/BOOT/grubenv list

agners avatar Apr 12 '24 08:04 agners

There is a new HAOS update! Before updating, I used the commands above, the hassos-persists journal is empty, and as I don't know how to properly ssh into haos, here's a screenshot for the current grub env:

Screenshot 2024-05-08 at 20 08 36

claplace avatar May 09 '24 00:05 claplace

I started the update, and went back print the grub env again and again... and suddenly it went empty:

Screenshot 2024-05-08 at 20 15 19 Screenshot 2024-05-08 at 20 15 55

and sure did, after reboot, my data disk was disabled.

claplace avatar May 09 '24 00:05 claplace

It happened again. So I went to understand how the HAOS update worked, and here's what I think I have understood.

HAOS is using RAUC, and the update is done from a rauc bundle. e.g. https://github.com/home-assistant/operating-system/releases/download/12.4/haos_generic-aarch64-12.4.raucb for the latest release.

Here's the bundle content:

$ ls -l
total 227656
-rw-r--r-- 1 cyprien cyprien  33554432 Jun 18 09:57 boot.vfat
-rwxr-xr-x 1 cyprien cyprien      4935 Jun 18 09:57 hook
-rw-r--r-- 1 cyprien cyprien  20287488 Jun 18 09:57 kernel.img
-rw-r--r-- 1 cyprien cyprien       521 Jun 18 09:57 manifest.raucm
-rw-r--r-- 1 cyprien cyprien 211886080 Jun 18 09:57 rootfs.img

boot.vfat contains the boot files that will replace the existing ones:

$ find boot
boot
boot/EFI
boot/EFI/BOOT
boot/EFI/BOOT/grub.cfg
boot/EFI/BOOT/bootaa64.efi
boot/EFI/BOOT/grubenv
boot/cmdline.txt
$ grub-editenv boot/EFI/BOOT/grubenv list
$

The grub environment is empty there. Now the bundle hook file contains a install_boot() function that replaces the existing boot files with the new one, making sure the *.txt files are restored (but I don't see any...).

        # Backup boot config
        cp -f "${BOOT_MNT}"/*.txt "${BOOT_TMP}/" || true

        cp -rf "${BOOT_NEW}"/* "${BOOT_MNT}/"

        # Restore boot config
        cp -f "${BOOT_TMP}"/*.txt "${BOOT_MNT}/" || true

Then I believe rauc is rewriting the grubenv with the new boot order, but it does not know about MACHINE_ID key, that does not appear in https://github.com/rauc/rauc/blob/master/src/bootchooser.c.

After manually downloading and installing the update bundle rauc install haos_generic-aarch64-12.4.raucb, I see that the grubenv does not have the MACHINE_ID entry:

image

A simple line added in the hook file could restore it:

image

claplace avatar Jun 28 '24 02:06 claplace