operating-system icon indicating copy to clipboard operation
operating-system copied to clipboard

HA Linux OS broken after 8.5 -> 9.0 update

Open ramphex opened this issue 2 years ago • 14 comments

Describe the issue you are experiencing

After starting the update in the GUI from 8.5 to 9.0, the server never booted back up.

After accessing the VM console in Proxmox, I'm presented with an UEFI shell (screen shot below)

Screen Shot 2022-09-16 at 15 50 11

No other configuration changes were made. Just launched the update and the system broke.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

8.5

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Launched 8.5 -> 9.0 update
  2. Fail

Anything in the Supervisor logs that might be useful for us?

No access to Supervisor

Anything in the Host logs that might be useful for us?

No access to HA settings

System Health information

No response

Additional information

Screen Shot 2022-09-16 at 15 52 33 Screen Shot 2022-09-16 at 15 52 57

Secure Boot is Disabled.

ramphex avatar Sep 16 '22 19:09 ramphex

Are you indeed using the generic-x86-64 image or the ova?

Hm, it seems that the boot partition got corrupted somehow. Can you download an Ubuntu Live image and boot into it? I wonder if the files on the first partition are intact. This is from a working ova installation:

EFI
EFI/BOOT
EFI/BOOT/grubenv-B
EFI/BOOT/grubenv-A
EFI/BOOT/bootx64.efi
EFI/BOOT/grub.cfg
EFI/BOOT/grubenv
cmdline.txt

You can restore the boot partition with a process along these lines (make sure to create a snapshot first!): https://github.com/home-assistant/operating-system/issues/1913#issuecomment-1128642946

agners avatar Sep 16 '22 20:09 agners

FWIW, I recommend using a SCSI hard disk over S-ATA, as it has usually better virtualization support. That said, just using S-ATA should not lead to such a corruption :cry:

agners avatar Sep 16 '22 20:09 agners

We had similar reports in the past, e.g. #1125. However, despite checking that everything is getting unmounted correctly and even syncing the FAT partition, we do get these reports from time to time. Did the update run through without hickup? Did you had to force a reboot or similar or did it reboot on its own?

agners avatar Sep 16 '22 20:09 agners

We had similar reports in the past, e.g. #1125. However, despite checking that everything is getting unmounted correctly and even syncing the FAT partition, we do get these reports from time to time. Did the update run through without hickup? Did you had to force a reboot or similar or did it reboot on its own?

The update was going through, then the prompt for "reconnecting" came up, and the page never reloaded. Upon inspecting the console on Proxmox, it presented me with the UEFI Shell. I'm assuming it rebooted on it's own. I have not been able to get out of that shell, nor am I familiar with how to navigate it. Checked the boot settings as per some other threads to make sure the Secure Boot is indeed disabled.

ramphex avatar Sep 16 '22 20:09 ramphex

Are you indeed using the generic-x86-64 image or the ova?

Hm, it seems that the boot partition got corrupted somehow. Can you download an Ubuntu Live image and boot into it? I wonder if the files on the first partition are intact. This is from a working ova installation:

EFI
EFI/BOOT
EFI/BOOT/grubenv-B
EFI/BOOT/grubenv-A
EFI/BOOT/bootx64.efi
EFI/BOOT/grub.cfg
EFI/BOOT/grubenv
cmdline.txt

You can restore the boot partition with a process along these lines (make sure to create a snapshot first!): #1913 (comment)

It might be OVA. I installed this a while ago and it has been running and upgrading through versions without any issues. First time any issue occurred.

I also have other VMs running on the same server. How would I access this VM's EFI partition to verify the file content?

ramphex avatar Sep 16 '22 20:09 ramphex

How would I access this VM's EFI partition to verify the file content?

You need to download a live image, e.g. Ubuntu from here, and attach that to your virtual machine. Then boot from that image. From the Ubuntu desktop you then can browse the HA disk. The first partition on the HA disk is the FAT boot partition.

agners avatar Sep 16 '22 20:09 agners

Screen Shot 2022-09-16 at 16 46 40

The EFI partition doesn't seem to be there. Just the main disk

ramphex avatar Sep 16 '22 20:09 ramphex

The EFI partition doesn't seem to be there. Just the main disk

Uff that means the partition table got lost. Can you run fdisk /dev/sda? Maybe it can recover the partition table from the backup GPT

agners avatar Sep 16 '22 20:09 agners

Screen Shot 2022-09-16 at 17 02 42

ramphex avatar Sep 16 '22 21:09 ramphex

Doesn't seem to be the case then 😢

agners avatar Sep 16 '22 21:09 agners

Do you have a recent backup? If so probably starting over is the best solution. If not, you can try the approach documented here https://github.com/home-assistant/operating-system/issues/1913#issuecomment-1128642946 (using vdi from HAOS 9.0)

If you can share the current state of the disk image I'd be interested to analyze it a bit more.

agners avatar Sep 16 '22 21:09 agners

So what do I do now? Is there any way to recover the backup from it? Why would the updater break the partition table?

Edit: I had a backup created in the homeassistant

Edit2: Could you please elaborate on how I could share the current state of the disk image?

ramphex avatar Sep 16 '22 21:09 ramphex

Screen Shot 2022-09-16 at 17 54 23 Tried making a backup of the VM in Proxmox :-\

ramphex avatar Sep 16 '22 21:09 ramphex

Do you have a recent backup? If so probably starting over is the best solution. If not, you can try the approach documented here #1913 (comment) (using vdi from HAOS 9.0)

If you can share the current state of the disk image I'd be interested to analyze it a bit more.

The solution from that post worked. Is there anything else I should do now to ensure the outmost stability? Not sure how much this fix has affected the underlying system operation, seems pretty straightforward though.

ramphex avatar Sep 16 '22 22:09 ramphex

Is there anything else I should do now to ensure the outmost stability?

Unfortunately I don't really know. I am not exactly sure where these corruptions come from.

agners avatar Oct 03 '22 19:10 agners

Happened to me going from 9.0 -> 9.2 on Proxmox.

Restoring my VM backup to the existing VM didn't work, nor did creating a new VM work.

I had to follow the steps posted above in Linux to restore the boot partition.

thedead avatar Oct 18 '22 02:10 thedead

I upgraded an 8.5 instance (also running on Proxmox) to 9.2 and did not experience this.

My VM settings: image image

ioctl2 avatar Oct 18 '22 03:10 ioctl2

I recently got this as well, upgraded to 9.2 and the VM would not boot. The proxmox UEFI could not see the drive as a bootable disk. Doing fdisk from proxmox it recognized the MBR was corrupt, and I was able to write it back. Then it started fine.

I have had several other VMs on that machine for years, and none of them managed to corrupt the MBR. I think this is specific to HAOS .

brainrecall avatar Jan 02 '23 17:01 brainrecall

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 02 '23 18:04 github-actions[bot]

Happened to me again today

thedead avatar Apr 11 '23 14:04 thedead