talos icon indicating copy to clipboard operation
talos copied to clipboard

RockPi 4c boot loop after talos upgrade

Open cvandesande opened this issue 2 years ago • 7 comments

Bug Report

Description

I believe this has been happening since 1.0 or 1.1. Every OS upgrade results in a boot loop, and I need to dd the sdcard to bring the node back.

The upgrade itself completes without error and installs grub to /dev/mmcblk1
After the reboot the node never comes back online. When I connect an HDMI cable, I see the following error: * specified install disk does not exist: "/dev/mmcblk1"

Logs

This is the upgrade log

blackrock: kern:  notice: [2022-07-28T12:33:27.956345828Z]: XFS (mmcblk1p3): Unmounting Filesystem
blackrock: user: warning: [2022-07-28T12:33:27.969568828Z]: 2022/07/28 12:33:54 preserved contents of "BOOT": 62079139 bytes
blackrock: user: warning: [2022-07-28T12:33:28.020093828Z]: 2022/07/28 12:33:54 preserved contents of "META": 1055 bytes
blackrock: user: warning: [2022-07-28T12:33:32.902606828Z]: 2022/07/28 12:33:59 preserved contents of "STATE": 135072 bytes
blackrock: user: warning: [2022-07-28T12:33:32.903265828Z]: 2022/07/28 12:33:59 resetting partition table on /dev/mmcblk1
blackrock: user: warning: [2022-07-28T12:33:32.919100828Z]: 2022/07/28 12:33:59 partitioning /dev/mmcblk1 - EFI "105 MB"
blackrock: user: warning: [2022-07-28T12:33:32.919729828Z]: 2022/07/28 12:33:59 created /dev/mmcblk1p1 (EFI) size 204800 blocks
blackrock: user: warning: [2022-07-28T12:33:32.920389828Z]: 2022/07/28 12:33:59 partitioning /dev/mmcblk1 - BIOS "1.0 MB"
blackrock: user: warning: [2022-07-28T12:33:32.921201828Z]: 2022/07/28 12:33:59 created /dev/mmcblk1p2 (BIOS) size 2048 blocks
blackrock: user: warning: [2022-07-28T12:33:32.921960828Z]: 2022/07/28 12:33:59 partitioning /dev/mmcblk1 - BOOT "1.0 GB"
blackrock: user: warning: [2022-07-28T12:33:32.922663828Z]: 2022/07/28 12:33:59 created /dev/mmcblk1p3 (BOOT) size 2048000 blocks
blackrock: user: warning: [2022-07-28T12:33:32.923431828Z]: 2022/07/28 12:33:59 partitioning /dev/mmcblk1 - META "1.0 MB"
blackrock: user: warning: [2022-07-28T12:33:32.924135828Z]: 2022/07/28 12:33:59 created /dev/mmcblk1p4 (META) size 2048 blocks
blackrock: user: warning: [2022-07-28T12:33:32.924905828Z]: 2022/07/28 12:33:59 partitioning /dev/mmcblk1 - STATE "105 MB"
blackrock: user: warning: [2022-07-28T12:33:32.925999828Z]: 2022/07/28 12:33:59 created /dev/mmcblk1p5 (STATE) size 204800 blocks
blackrock: user: warning: [2022-07-28T12:33:32.926714828Z]: 2022/07/28 12:33:59 partitioning /dev/mmcblk1 - EPHEMERAL "0 B"
blackrock: user: warning: [2022-07-28T12:33:32.927376828Z]: 2022/07/28 12:33:59 created /dev/mmcblk1p6 (EPHEMERAL) size 122249216 blocks
blackrock: user: warning: [2022-07-28T12:33:32.939396828Z]: 2022/07/28 12:33:59 formatting the partition "/dev/mmcblk1p1" as "vfat" with label "EFI"
blackrock: user: warning: [2022-07-28T12:33:33.328981828Z]: 2022/07/28 12:33:59 zeroing out "/dev/mmcblk1p2"
blackrock: user: warning: [2022-07-28T12:33:33.388437828Z]: 2022/07/28 12:33:59 formatting the partition "/dev/mmcblk1p3" as "xfs" with label "BOOT"
blackrock: user: warning: [2022-07-28T12:33:36.718076828Z]: 2022/07/28 12:34:02 zeroing out "/dev/mmcblk1p4"
blackrock: user: warning: [2022-07-28T12:33:36.775750828Z]: 2022/07/28 12:34:03 zeroing out "/dev/mmcblk1p5"
blackrock: user: warning: [2022-07-28T12:33:41.891533828Z]: 2022/07/28 12:34:08 zeroing out "/dev/mmcblk1p6"
blackrock: kern:  notice: [2022-07-28T12:33:41.949740828Z]: XFS (mmcblk1p3): Mounting V5 Filesystem
blackrock: kern:    info: [2022-07-28T12:33:42.069950828Z]: XFS (mmcblk1p3): Ending clean mount
blackrock: kern:  notice: [2022-07-28T12:33:48.236939828Z]: XFS (mmcblk1p3): Unmounting Filesystem
blackrock: user: warning: [2022-07-28T12:33:48.291965828Z]: 2022/07/28 12:34:14 restored contents of "BOOT"
blackrock: user: warning: [2022-07-28T12:33:48.358252828Z]: 2022/07/28 12:34:14 restored contents of "META"
blackrock: user: warning: [2022-07-28T12:33:53.509970828Z]: 2022/07/28 12:34:19 restored contents of "STATE"
blackrock: kern:  notice: [2022-07-28T12:33:53.528782828Z]: XFS (mmcblk1p3): Mounting V5 Filesystem
blackrock: kern:    info: [2022-07-28T12:33:53.682811828Z]: XFS (mmcblk1p3): Ending clean mount
blackrock: user: warning: [2022-07-28T12:33:53.704839828Z]: 2022/07/28 12:34:19 copying /usr/install/arm64/vmlinuz to /boot/B/vmlinuz
blackrock: user: warning: [2022-07-28T12:33:53.868117828Z]: 2022/07/28 12:34:20 copying /usr/install/arm64/initramfs.xz to /boot/B/initramfs.xz
blackrock: user: warning: [2022-07-28T12:33:53.957148828Z]: 2022/07/28 12:34:20 writing /boot/grub/grub.cfg to disk
blackrock: user: warning: [2022-07-28T12:33:53.960706828Z]: 2022/07/28 12:34:20 executing: grub-install --boot-directory=/boot --efi-directory=/boot/EFI --removable --target=arm64-efi /dev/mmcblk1
blackrock: user: warning: [2022-07-28T12:33:53.962825828Z]: Installing for arm64-efi platform.
blackrock: user: warning: [2022-07-28T12:34:00.944116828Z]: Installation finished. No error reported.
blackrock: user: warning: [2022-07-28T12:34:00.996943828Z]: 2022/07/28 12:34:27 installing U-Boot for "rockpi_4"
blackrock: user: warning: [2022-07-28T12:34:01.010227828Z]: 2022/07/28 12:34:27 writing /usr/install/arm64/u-boot/rockpi_4/u-boot-rockchip.bin at offset 32768
blackrock: user: warning: [2022-07-28T12:34:01.027306828Z]: 2022/07/28 12:34:27 wrote 9368664 bytes
blackrock: kern:  notice: [2022-07-28T12:34:01.717255828Z]: XFS (mmcblk1p3): Unmounting Filesystem
blackrock: user: warning: [2022-07-28T12:34:01.787748828Z]: 2022/07/28 12:34:28 installation of v1.1.2 complete
blackrock: rpc error: code = Unavailable desc = error reading from server: EOF

Monitor photo only as the RockPI is not booting RIZ3DuG9R3uIBeciKYjYYQ_1548b67d5ec8451e3cc6dc041b12a87df98c386d

Environment

  • Talos version: 1.1.1 > 1.1.2
  • Platform: RockPI 4c, arm64

cvandesande avatar Jul 28 '22 12:07 cvandesande

I don't have an exact answer, but looks like after upgrade mmcblk1 disappears.

I'm not familiar with the board, but to make Talos happy you can update machine configuration before an upgrade with machine: install: disk: /dev/mmcblk0. In fact Talos only ever uses that on actual install, so the value doesn't matter after an initial install.

smira avatar Jul 28 '22 13:07 smira

I'll update the machine config and follow up

cvandesande avatar Jul 28 '22 13:07 cvandesande

Interesting observation. I dd'd a fresh 1.1.2 install to the sdcard, and the sdcard shows up as /dev/mmcblk1 The apply-config command fails if I try to set the install disk to /dev/mmcblk0

$ talosctl disks --insecure --nodes blackrock
DEV            MODEL             SERIAL       TYPE   UUID   WWID   MODALIAS      NAME    SIZE     BUS_PATH
/dev/mmcblk1   -                 0x4e01293a   SD     -      -      -             SN64G   64 GB    /platform/fe320000.mmc/mmc_host/mmc1/mmc1:aaaa/
/dev/sda       Samsung SSD 870   -            SSD    -      -      scsi:t-0x00   -       2.0 TB   /platform/f8000000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/ata1/host0/target0:0:0/0:0:0:0/

cvandesande avatar Jul 28 '22 13:07 cvandesande

interesting... it changes with a reboot? in fact that we validate the install disk in that phase is a bug, as it's irrelevant - Talos finds its block device by partition labels

smira avatar Jul 28 '22 13:07 smira

Yeah, just reviewing the logs, the sdcard seems to change from /dev/mmcblk1 to /dev/mmcblk0 but only after upgrades. If I kill the power to the node it will restart properly. The changing disk only seems to happen during upgrades

cvandesande avatar Jul 28 '22 13:07 cvandesande

I have a rock pi4c and never seen any issues with upgrades, but in my case it a nvme, so it always shows up as /dev/nvme0. Is the /dev/mmcblk the on-board spi flash?

frezbo avatar Jul 28 '22 13:07 frezbo

I've found an old 16G eMMC card lying around, and successfully installed Talos and booted off eMMC. The eMMC shows up as /dev/mmcblk0. I've updated the machine config and the RockPI has rejoined the cluster.

I'll wait for the next upgrade and see if it works better

$ talosctl disks --nodes blackrock dmesg
NODE        DEV                 MODEL             SERIAL       TYPE   UUID   WWID   MODALIAS      NAME     SIZE     BUS_PATH
blackrock   /dev/mmcblk0        -                 0xd2e703a1   SD     -      -      -             58A43A   16 GB    /platform/fe330000.mmc/mmc_host/mmc0/mmc0:0001/
blackrock   /dev/mmcblk0boot0   -                 -            SD     -      -      -             -        4.2 MB   /platform/fe330000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot0
blackrock   /dev/mmcblk0boot1   -                 -            SD     -      -      -             -        4.2 MB   /platform/fe330000.mmc/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot1
blackrock   /dev/sda            Samsung SSD 870   -            SSD    -      -      scsi:t-0x00   -        2.0 TB   /platform/f8000000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/ata4/host3/target3:0:0/3:0:0:0/

cvandesande avatar Jul 28 '22 14:07 cvandesande

So I'm back with this same issue after upgrading from 1.1.2 to 1.2.5. It's the same error, except now /dev/mmcblk0 is missing, and /dev/mmcblk1 has appeared.

This might be a dumb question but I noticed in Grub I have 2 choices, A, B. For this upgrade, 1.2.5 appears on B. Is it possible that when grub boots choice B, the mmc card changes mount points?

cvandesande avatar Oct 13 '22 17:10 cvandesande

this must be the kernel, since there's kernel updates between 1.1.2 and 1.2.5, i think you should do talosctl disks and specify the install disk by diskselector, so it will always find the right one

frezbo avatar Oct 13 '22 17:10 frezbo

Is the disk selector available to me for version upgrades? The disk whether it's mmcblk1 or mmcblk0 is always correct for the initial install (I've re-flashed the MMC many times due to this issue). This issue only occurs when I run talosctl upgrade.

cvandesande avatar Oct 13 '22 18:10 cvandesande

Is the disk selector available to me for version upgrades? The disk whether it's mmcblk1 or mmcblk0 is always correct for the initial install (I've re-flashed the MMC many times due to this issue). This issue only occurs when I run talosctl upgrade.

I believe you'd have to start with a fresh install.

frezbo avatar Oct 13 '22 18:10 frezbo

https://www.talos.dev/v1.2/reference/configuration/#installdiskselector you should use the busPath

frezbo avatar Oct 13 '22 18:10 frezbo

Oh busPath looks promising! I will test and follow up.

cvandesande avatar Oct 13 '22 18:10 cvandesande

Ok, so reflashed and reinstalled again, this time using diskSelector. Does the following seem correct?

    install:
        diskSelector:
            busPath: /platform/fe330000.mmc/mmc_host/*

$ talosctl disks --nodes blackrock
NODE        DEV                 MODEL             SERIAL       TYPE   UUID   WWID   MODALIAS      NAME     SIZE     BUS_PATH
blackrock   /dev/mmcblk1        -                 0xd2e703a1   SD     -      -      -             58A43A   16 GB    /platform/fe330000.mmc/mmc_host/mmc1/mmc1:0001/
blackrock   /dev/mmcblk1boot0   -                 -            SD     -      -      -             -        4.2 MB   /platform/fe330000.mmc/mmc_host/mmc1/mmc1:0001/block/mmcblk1/mmcblk1boot0
blackrock   /dev/mmcblk1boot1   -                 -            SD     -      -      -             -        4.2 MB   /platform/fe330000.mmc/mmc_host/mmc1/mmc1:0001/block/mmcblk1/mmcblk1boot1
blackrock   /dev/sda            Samsung SSD 870   -            SSD    -      -      scsi:t-0x00   -        2.0 TB   /platform/f8000000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/ata4/host3/target3:0:0/3:0:0:0/

I actually flashed Talos 1.2.4, by mistake, but it provided a good opportunity to test an upgrade. The upgrade worked, however the MMC card remained at /dev/mmcblk1, so I couldn't really test if an upgrade would work if the MMC decides to appear as mmc0

cvandesande avatar Oct 13 '22 18:10 cvandesande

Using the diskSelector config from https://github.com/siderolabs/talos/issues/5978#issuecomment-1278040032 resolved my failed to boot after upgrade issue.

Does this require a documentation update for RockPI's or am I the only one that had the issue? I can submit a PR for docs if anything thinks it's needed. Otherwise, thanks for the help!

cvandesande avatar Oct 28 '22 11:10 cvandesande

I'll share my findings with a RPI4, where I also had an issue with /dev/mmcblk1 becoming /dev/mmcblk0 after the installation of Talos. Slack Thread

After the installation, Talos from PXE boot was trying to boot from disk, but throwing an error message:

specified install disk does not exist: /dev/mmcblk1

And guess what, during the initial install of uboot, the SD card is called /dev/mmcblk1 and it installs properly. But during the following uboot boot, the SD card gets recognized as /dev/mmcblk0: (this is after the install)

talosctl -n 192.168.1.167 --talosconfig cluster-0-talosconfig disks
NODE            DEV            MODEL   SERIAL       TYPE   UUID   WWID   MODALIAS   NAME    SIZE    BUS_PATH
                       SUBSYSTEM          SYSTEM_DISK
192.168.1.167   /dev/mmcblk0   -       0xb285ce23   SD     -      -      -          SN64G   64 GB   /platform/emmc2bus/fe340000.mmc/mmc_host/mmc0/mmc0:aaaa/   /sys/class/block   *

blackliner avatar Dec 07 '23 04:12 blackliner

Talos 1.6 drops this erroneous config validation on boot, fyi.

smira avatar Dec 08 '23 15:12 smira