bugs icon indicating copy to clipboard operation
bugs copied to clipboard

dm-verity: How to fix incorrect hash?

Open paulmenzel opened this issue 4 years ago • 3 comments

Issue Report

Bug

Container Linux Version

Unfortunately, there are no instructions on how to extract that from the CoreOS rescue system if USR-A/B cannot be mounted. /etc/os-release from the rescue system shows the dracut details.

Environment

Bare metal server from Hetzner.

Expected Behavior

The system should boot.

Actual Behavior

The system does not boot.

Reproduction Steps

A fault disk was replaced, and the following command was used to copy the first partitions to the new disk.

# Copy boot partitions
dd if=/dev/nvme0n1 of=/dev/nvme1n1 bs=512 count=270335
# Update partition types …
sfdisk --part-type /dev/nvme1n1 1 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6
sfdisk --part-type /dev/nvme1n1 2 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6
sfdisk --part-type /dev/nvme1n1 6 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6

The layout looks like below, so the values for dd should be correct.

$ cgpt show /dev/nvme0n1
       start        size    part  contents
           0           1          PMBR
           1           1          Pri GPT header
           2          32          Pri GPT table
        4096      262144       1  Label: "EFI-SYSTEM"
                                  Type: EFI System Partition
                                  UUID: 36757B18-F990-4FD6-847C-CF8FDC4F97F2
      266240        4096       2  Label: "BIOS-BOOT"
                                  Type: 21686148-6449-6E6F-744E-656564454649
                                  UUID: 679BD591-1D9E-4896-881B-FDDEA79EA6C6
      270336     2097152       3  Label: "USR-A"
                                  Type: 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6
                                  UUID: 7130C94A-213A-4E5A-8E26-6CCE9662F132
     2367488     2097152       4  Label: "USR-B"
                                  Type: 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6
                                  UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A57C
     4464640      262144       6  Label: "OEM"
                                  Type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
                                  UUID: 6A2925C8-FFA8-455B-A72C-15470412A3BB
     4726784      131072       7  Label: "OEM-CONFIG"
                                  Type: C95DC21A-DF0E-4340-8D7B-26CBFA9A03E0
                                  UUID: 7B524685-0362-4B8B-8B2F-A1F58C06ABFF
     4857856     4427776       9  Label: "ROOT"
                                  Type: 3884DD41-8582-4404-B9A8-E9B84F2DF50E
                                  UUID: 7704DCC6-6B62-4B14-9BD3-73148C9A0AC4
     9285632   990929551      10  Label: "raid.1.1"
                                  Type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
                                  UUID: EDB7BF9D-16E0-49B1-9277-08600D574A9B
  1000215183          32          Sec GPT table
  1000215215           1          Sec GPT header
$ cgpt show /dev/nvme1n1
       start        size    part  contents
           0           1          PMBR
           1           1          Pri GPT header
           2          32          Pri GPT table
        4096      262144       1  Label: "EFI-SYSTEM"
                                  Type: 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6
                                  UUID: 36757B18-F990-4FD6-847C-CF8FDC4F97F2
      266240        4096       2  Label: "BIOS-BOOT"
                                  Type: 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6
                                  UUID: 679BD591-1D9E-4896-881B-FDDEA79EA6C6
      270336     2097152       3  Label: "USR-A"
                                  Type: 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6
                                  UUID: 7130C94A-213A-4E5A-8E26-6CCE9662F132
     2367488     2097152       4  Label: "USR-B"
                                  Type: 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6
                                  UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A57C
     4464640      262144       6  Label: "OEM"
                                  Type: 5DFBF5F4-2848-4BAC-AA5E-0D9A20B745A6
                                  UUID: 6A2925C8-FFA8-455B-A72C-15470412A3BB
     4726784      131072       7  Label: "OEM-CONFIG"
                                  Type: C95DC21A-DF0E-4340-8D7B-26CBFA9A03E0
                                  UUID: 7B524685-0362-4B8B-8B2F-A1F58C06ABFF
     4857856     4427776       9  Label: "ROOT"
                                  Type: 3884DD41-8582-4404-B9A8-E9B84F2DF50E
                                  UUID: 7704DCC6-6B62-4B14-9BD3-73148C9A0AC4
     9285632   990929551      10  Label: "raid.1.1"
                                  Type: 0FC63DAF-8483-4772-8E79-3D69D8477DE4
                                  UUID: EDB7BF9D-16E0-49B1-9277-08600D574A9B
  1000215183          32          Sec GPT table
  1000215215           1          Sec GPT header

Restarting the system, the boot fails because the verity hash mismatch causes read errors.

device-mapper: verity: metadata block XXX is corrupted

Removing the verity from the Linux kernel command line in GRUB, the mount unit hangs (as expected).

The system should give more useful feedback. Even if it’s just, that CoreOS needs to be installed from scratch, as there is no way around fixing it. (That is what has to be done now, so further debugging or log gathering will prove difficult.)

paulmenzel avatar Jul 29 '19 22:07 paulmenzel

CL doesn't support having multiple USR-A / USR-B / EFI-SYSTEM / OEM partitions. I suspect that on update the kernel in one EFI-SYSTEM is getting updated and the USR A/B on the other disk is getting updated. The kernel has the verity hash but is checking against the wrong USR A/B. You might be able to salvage it by deleting everything but raid.1.1 on one of the disks, and seeing if the auto-rollback will work. If it doesn't you'll probably need to reinstall. Basically make sure you only have 1 one all the NAMED partitions.

ajeddeloh avatar Jul 29 '19 22:07 ajeddeloh

What is 1.1 in “raid.1.1”?

The partitions were copied with dd, so they matched exactly, and there was no update in between. Does that make sense in any way?

Reinstallation was the only option.

paulmenzel avatar Jul 30 '19 15:07 paulmenzel

Reinstallation was the only option.

Just to clarify on this (I work with @paulmenzel) a reinstallation only worked after we wiped both disks with wipefs. We also left out copying the partitions in the end.

We first tried to just reinstall like normal and copied the partitions table and contents with sgdisk and dd as follows:

sgdisk /dev/nvme0n1 -R /dev/nvme1n1 && \
  sgdisk -G /dev/nvme1n1

dd if=/dev/nvme0n1 of=/dev/nvme1n1 bs=512 count=4427776 # (first 9 partitions)

but whilst the server then booted up, restarting it seemed to trigger the same verity error.

douglasward avatar Jul 30 '19 15:07 douglasward