centos2ol icon indicating copy to clipboard operation
centos2ol copied to clipboard

Ensure the correct EFI boot entries are created after switching from CentOS to Oracle Linux

Open metal4lyf opened this issue 4 years ago • 31 comments

We are trying to migrate CentOS 8 systems to OL8.

The conversion script reports success, but it renders our systems unbootable: After the BIOS splash, we get several >> Checking media presence ..... messages on the terminal and then the system enters Dell BIOS recovery mode, which performs a memory test and then reports "No bootable devices found! ..."

Boot params are UEFI/Legacy Boot: OFF/Secure Boot: OFF.

I've isolated this issue to OL8 grub. Using a recovery stick, if I re-enable the CentOS BaseOS repo and install the latest version of grub2*, the system will boot to login with expected entries ("Oracle Linux" etc.) in the grub menu.

We're hesitant to proceed with migrations using this workaround because it requires us to continue using a potentially unsupported version of a fundamental component, not to mention we'll have to exclude grub in our dnf config to avoid bricking on dnf upgrades.

We use a stock grub configuration as far as I know.

/etc/default/grub:

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="crashkernel=auto resume=/dev/mapper/VolGroup00-swap rd.lvm.lv=VolGroup00/root rd.lvm.lv=VolGroup00/swap rhgb quiet"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true

/boot/grub2/grubenv:

# GRUB Environment Block
kernelopts=root=/dev/mapper/VolGroup00-root ro crashkernel=auto resume=/dev/mapper/VolGroup00-swap rd.lvm.lv=VolGroup00/root rd.lvm.lv=VolGroup00/swap rhgb quiet
boot_success=1
boot_indeterminate=0
######################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Kernel: 4.18.0-240.15.1.el8_3.x86_64 Bricking grub2*: 1:2.02-90.0.2.el8_3.1 from ol8_baseos_latest Working grub2*: <= 1:2.02-90.el8_3.1 from BaseOS

I've yet to see a useful message from grub despite removing rhgb quiet. Please let me know what other info would help here.

metal4lyf avatar Mar 08 '21 22:03 metal4lyf

Does 1:2.02-90.0.1.el8 work? If so, that at least will narrow our focus to the fixes in the .0.2 release.

Djelibeybi avatar Mar 08 '21 23:03 Djelibeybi

Also, can you tell us what type of device and controller you're using to boot?

Djelibeybi avatar Mar 08 '21 23:03 Djelibeybi

Could you also try running grub2-install <boot device> prior to rebooting to see if that resolves the issue?

Djelibeybi avatar Mar 08 '21 23:03 Djelibeybi

Here's the boot device info. I'll try the grub install now.

$ sudo lshw -class disk
  *-disk
       description: ATA Disk
       product: ST2000DM001-1ER1
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: CC27
       serial: Z4Z703D6
       size: 1863GiB (2TB)
       capacity: 1863GiB (2TB)
       capabilities: 7200rpm gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=6 guid=9d1775b4-835a-4c76-9e43-5d544b7ec8fc logicalsectorsize=512 sectorsize=4096

metal4lyf avatar Mar 08 '21 23:03 metal4lyf

Boot params are UEFI/Legacy Boot: OFF/Secure Boot: OFF.

To be on the same page, is system in UEFI mode or legacy ? To check you can check /sys/firmware/efi presence on the booted system.

aburmash avatar Mar 09 '21 00:03 aburmash

UEFI

metal4lyf avatar Mar 09 '21 00:03 metal4lyf

I can't get grub2-install working. It complains about missing modinfo.sh. No directory under /boot contains this file so I'm not sure what to pass it. Trying 90.0.1 now.

metal4lyf avatar Mar 09 '21 00:03 metal4lyf

Yeah, forget about grub2-install. It is for legacy. Please, just before the reboot do efibootmgr -v find /boot |grep redhat find /boot |grep centos rpm -qa |grep shim

aburmash avatar Mar 09 '21 00:03 aburmash

90.0.1 doesn't boot either. Reinstalled 90.0.2. Here are the results: efibootmgr -v:

BootCurrent: 0011
Timeout: 1 seconds
BootOrder: 0001,000C,000D,000E,000F,0010,0006,0011,0008,0009,000A,000B
Boot0000* Windows Boot Manager	HD(1,GPT,87d93515-2374-4b87-9701-5a4c527ee83b,0x800,0x145000)/File(\EFI\Microsoft\Boot\bootmgfw.efi)WINDOWS.........x...B.C.D.O.B.J.E.C.T.=.{.9.d.e.a.8.6.2.c.-.5.c.d.d.-.4.e.7.0.-.a.c.c.1.-.f.3.2.b.3.4.4.d.4.7.9.5.}...;................
Boot0001* CentOS Linux	HD(1,GPT,b7460ef9-456e-4086-95f9-7dc69e80ddaa,0x800,0x12c000)/File(\EFI\centos\shimx64.efi)
Boot0006* HDD	NVMe(0x1,01-00-00-00-00-00-00-00)/HD(1,GPT,54969a86-cdfd-4d17-a677-4063a30945af,0x800,0x12c000)
Boot0008* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter	PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(b4969130ba1c,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0009* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter	PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(b4969130ba1c,1)/IPv6([::]:<->[::]:,0,0)..BO
Boot000A* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter	PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(b4969130ba1e,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot000B* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter	PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(b4969130ba1e,1)/IPv6([::]:<->[::]:,0,0)..BO
Boot000C* Diskette Drive	BBS(Floppy,Diskette Drive,0x0)..BO
Boot000D* Internal HDD	BBS(HD,Internal HDD,0x0)..BO
Boot000E* USB Storage Device	BBS(USB,SanDisk,0x0)..BO
Boot000F* P7: HL-DT-ST DVD-ROM DH50N	BBS(CDROM,P7: HL-DT-ST DVD-ROM DH50N,0x0)..BO
Boot0010  Onboard NIC	BBS(Network,IBA CL Slot 00FE v0110,0x0)..BO
Boot0011* UEFI: SanDisk	PciRoot(0x0)/Pci(0x14,0x0)/USB(7,0)/USB(1,0)/HD(1,GPT,87182ce7-da3d-414d-9ff3-3182544d7675,0x800,0x1dcf7df)..BO

find /boot | grep redhat:

/boot/efi/EFI/redhat
/boot/efi/EFI/redhat/fonts
/boot/efi/EFI/redhat/grubenv
/boot/efi/EFI/redhat/grubx64.efi

find /boot | grep centos:

/boot/efi/EFI/centos
/boot/efi/EFI/centos/shimx64-centos.efi
/boot/efi/EFI/centos/BOOTX64.CSV
/boot/efi/EFI/centos/mmx64.efi
/boot/efi/EFI/centos/grubenv
/boot/efi/EFI/centos/grub.cfg
/boot/efi/EFI/centos/shimx64.efi

rpm -qa | grep shim:

shim-x64-15-15.el8_2.x86_64

metal4lyf avatar Mar 09 '21 00:03 metal4lyf

How are you running centos2ol.sh, i.e. what parameters are you using?

Djelibeybi avatar Mar 09 '21 00:03 Djelibeybi

With or without -k, doesn't seem to matter. When we don't pass -k, uek is installed but not enabled. Shim does upgrade when we downgrade to BaseOS grub. I've also verified with BaseOS grub that we can boot uek.

metal4lyf avatar Mar 09 '21 00:03 metal4lyf

The shim-x64 package should be downgraded as part of the distro-sync that is run by default, i.e. after the switch you should have shim-x64-15-11 installed.

Djelibeybi avatar Mar 09 '21 00:03 Djelibeybi

And if you don't pass -k, the UEK should be installed and enabled, again with the downgrade of shim. Something else is happening here. Can you run the switch and pipe the output to a log file so we can see the entire process? If possible, run the script with no parameters, i.e. bash centos2ol.sh | tee -a centos2ol.log

Djelibeybi avatar Mar 09 '21 00:03 Djelibeybi

@metal4lyf So you have centos shim and Oracle grub, that explains the problem. pretty sure if you will do

  1. rpm -e shim-x64 ( remove centos shim )
  2. yum install shim-x64 ( from Oracle repos ) and do the reboot everything will automagically start working.

if NOT you will still need to replace centos shim with oracle shim and do efibootmgr -c -d /dev/sda -p 1 -L "Oracle Linux" -l "\EFI\redhat\shimx64.efi"

Where /dev/sda is the ESP disk 1 is the partition number. You can do mount |grep boot and see what disk is mounted at /boot/efi to determine that. ( please notice, i am writing about ESP disk, not boot disk ).

EDIT: you may need to do rpm -e shim-x64 --force but careful(!): 100% install a new shim after removal of old one.

EDIT2: we still need to figure out why in your case shim was not replaced.

aburmash avatar Mar 09 '21 00:03 aburmash

Thanks, I'll wipe this system and stage it for another run tomorrow AM. I will update with the logs and then try your suggestions. (The reason for CentOS shim and Oracle grub is because I downgraded grub to CentOS in recovery mode after the boot failed, which switched to CentOS shim, and thereafter upgraded grub to Oracle, which did not modify shim.)

metal4lyf avatar Mar 09 '21 00:03 metal4lyf

Thanks @metal4lyf -- we very much appreciate the effort here!

Djelibeybi avatar Mar 09 '21 00:03 Djelibeybi

@metal4lyf if the system is not booting with Oracle shim + Oracle grub2, efibootmgr will save you. Pretty much we anyway should apply a fix on our side for this, so running efibootmgr should be an immediate fix for you, before it is addressed by migration script.

aburmash avatar Mar 09 '21 10:03 aburmash

Here's the state after a fresh migration with no flags to the script. I may have lost the log but I'll find it or run again and add here.

#!/bin/bash -xv

grubby --info=ALL | grep ^kernel
+ grubby --info=ALL
+ grep '^kernel'
kernel="/boot/vmlinuz-5.4.17-2036.104.4.el8uek.x86_64"
kernel="/boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64"
kernel="/boot/vmlinuz-4.18.0-147.el8.x86_64"
kernel="/boot/vmlinuz-0-rescue-1e1b6984890346aab6d2b455f4f5af16"

grubby --default-kernel
+ grubby --default-kernel
/boot/vmlinuz-5.4.17-2036.104.4.el8uek.x86_64

efibootmgr -v
+ efibootmgr -v
BootCurrent: 0001
Timeout: 1 seconds
BootOrder: 0001,000C,000D,000E,000F,0010,0006,0011,0008,0009,000A,000B
Boot0000* Windows Boot Manager	HD(1,GPT,87d93515-2374-4b87-9701-5a4c527ee83b,0x800,0x145000)/File(\EFI\Microsoft\Boot\bootmgfw.efi)WINDOWS.........x...B.C.D.O.B.J.E.C.T.=.{.9.d.e.a.8.6.2.c.-.5.c.d.d.-.4.e.7.0.-.a.c.c.1.-.f.3.2.b.3.4.4.d.4.7.9.5.}...;................
Boot0001* CentOS Linux	HD(1,GPT,1e67f230-95c6-44d2-a9be-0f5cccc00561,0x800,0x12c000)/File(\EFI\centos\shimx64.efi)
Boot0006* HDD	NVMe(0x1,01-00-00-00-00-00-00-00)/HD(1,GPT,54969a86-cdfd-4d17-a677-4063a30945af,0x800,0x12c000)
Boot0008* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter	PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(b4969130ba1c,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0009* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter	PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(b4969130ba1c,1)/IPv6([::]:<->[::]:,0,0)..BO
Boot000A* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter	PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(b4969130ba1e,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot000B* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter	PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(b4969130ba1e,1)/IPv6([::]:<->[::]:,0,0)..BO
Boot000C* Diskette Drive	BBS(Floppy,Diskette Drive,0x0)..BO
Boot000D* Internal HDD	BBS(HD,Internal HDD,0x0)..BO
Boot000E* USB Storage Device	BBS(USB,SanDisk,0x0)..BO
Boot000F* P7: HL-DT-ST DVD-ROM DH50N	BBS(CDROM,P7: HL-DT-ST DVD-ROM DH50N,0x0)..BO
Boot0010  Onboard NIC	BBS(Network,IBA CL Slot 00FE v0110,0x0)..BO
Boot0011* UEFI: SanDisk	PciRoot(0x0)/Pci(0x14,0x0)/USB(7,0)/USB(1,0)/HD(1,GPT,87182ce7-da3d-414d-9ff3-3182544d7675,0x800,0x1dcf7df)..BO

find /boot | grep redhat
+ find /boot
+ grep redhat
/boot/efi/EFI/redhat
/boot/efi/EFI/redhat/fonts
/boot/efi/EFI/redhat/grubenv
/boot/efi/EFI/redhat/grubx64.efi
/boot/efi/EFI/redhat/BOOTX64.CSV
/boot/efi/EFI/redhat/mmx64.efi
/boot/efi/EFI/redhat/shimx64.efi
/boot/efi/EFI/redhat/grub.cfg

find /boot | grep centos
+ find /boot
+ grep centos
/boot/efi/EFI/centos
/boot/efi/EFI/centos/grubenv
/boot/efi/EFI/centos/grub.cfg

rpm -qa | grep shim
+ rpm -qa
+ grep shim
shim-x64-15-11.0.5.x86_64

metal4lyf avatar Mar 09 '21 15:03 metal4lyf

OK, so what is actually happening in your case: since you have migrated from Centos to Oracle, centos EFI binaries are wiped, and Centos UEFI boot entry will be wiped on next reboot. In that case, normally ( on most systems ) /boot/efi/EFI/BOOT/BOOTX64.EFI binary is being executed ( that is the "default" boot path ) and it executes fallback, which creates UEFI boot entries for Oracle Linux. Looks like in your case that is not happening.

As an immediate measure you can run

efibootmgr -c -d /dev/sda -p 1 -L "Oracle Linux" -l "\EFI\redhat\shimx64.efi"

Where /dev/sda is the ESP disk 1 is the partition number. You can do mount |grep boot and see what disk is mounted at /boot/efi to determine that.

To create boot entry for Oracle Linux before the reboot. Ping if you are unsure what to do with efibootmgr, and i will provide a more detailed instruction.

Anyway, this case ( fallback not happening ) should be covered by our migration scripts, and that efibootmgr call should happen automatically.

aburmash avatar Mar 09 '21 15:03 aburmash

Ran the migration again. Log here: ol8.log

Before reboot I ran efibootmgr as follows: lsblk

NAME                MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                   8:0    1 14.9G  0 disk 
└─sda1                8:1    1 14.9G  0 part /mnt/sd
sr0                  11:0    1 1024M  0 rom  
nvme0n1             259:0    0  1.9T  0 disk 
├─nvme0n1p1         259:1    0  600M  0 part /boot/efi
├─nvme0n1p2         259:2    0    1G  0 part /boot
└─nvme0n1p3         259:3    0  1.9T  0 part 
  ├─VolGroup00-root 253:0    0   50G  0 lvm  /
  ├─VolGroup00-swap 253:1    0  128G  0 lvm  [SWAP]
  └─VolGroup00-home 253:2    0  1.7T  0 lvm  /home

efibootmgr -c -d /dev/nvme0n1 -p 1 -L "Oracle Linux" -l "\EFI\redhat\shimx64.efi":

BootCurrent: 0001
Timeout: 1 seconds
BootOrder: 0003,0001,0002,000C,000D,000E,000F,0010,0006,0011,0008,0009,000A,000B
Boot0000* Windows Boot Manager
Boot0001* CentOS Linux
Boot0002* Oracle Linux
Boot0006* HDD
Boot0008* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter
Boot0009* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter
Boot000A* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter
Boot000B* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter
Boot000C* Diskette Drive
Boot000D* Internal HDD
Boot000E* USB Storage Device
Boot000F* P7: HL-DT-ST DVD-ROM DH50N
Boot0010  Onboard NIC
Boot0011* UEFI: SanDisk
Boot0003* Oracle Linux

I got a warning about Oracle Linux already being present as Boot0002, but this does appear to have fixed the boot!

metal4lyf avatar Mar 09 '21 16:03 metal4lyf

Oracle Linux does now show up twice in our UEFI boot menu. Is there a variant of the efibootmgr command that would consolidate/overwrite instead?

EDIT: I may have clobbered the boot menu on the USB drive I've been using to reinstall this system. I wonder if the presence of this disk is related to the boot manager issues too?

metal4lyf avatar Mar 09 '21 16:03 metal4lyf

Well, if entry was already present you do not need to recreate it. Do both entries persist after reboot ? if yes, pretty much we ( and you ) will need a simple check to only execute efibootmgr in case Oracle Linux entry is NOT present, something like if ! efibootmgr -v |grep -q "Oracle Linux"; then //execute efibootmgr -c -d blahblah fi

USB disk can't affect number of entries since they are stored in NVRAM, not on any plugged in media. However (1): for the same reason ( NVRAM storage ), UEFI boot entries will not be wipted, if you reinstall the system, and binaries that are in those boot entries are actually present.

aburmash avatar Mar 09 '21 17:03 aburmash

Ran the migration again. Log here: ol8.log

According to this log, the switch installed our shim-x64 package as an upgrade. I also noticed that the script had to upgrade a bunch of packages to get yum-utils to install. Did you perhaps do a dnf update on the CentOS instance before switching last time? Because this run looks pretty flawless from a log perspective (and would explain the duplicate UEFI boot entries).

Djelibeybi avatar Mar 09 '21 17:03 Djelibeybi

I did not run dnf update last time. If the server has network access, our installer adds an internal application package post-install and performs a distro sync, so perhaps that explains the difference? Sometimes I unplug network prior to save time. This all happens before running centos2ol.sh (with network).

I've run this many times now, with and without network on the initial install, and the result has always been the same. The logs from centos2ol.sh always look clean despite leaving the system unbootable.

metal4lyf avatar Mar 09 '21 17:03 metal4lyf

@aburmash knows way more about UEFI than I do, so I'm hoping to see a pull request soon that adds a bit of efibootmgr magic to centos2ol.sh to mitigate this issue.

Djelibeybi avatar Mar 09 '21 18:03 Djelibeybi

Well, if entry was already present you do not need to recreate it. Do both entries persist after reboot ?

Looks like that was a fluke, or at any rate there is only one entry after reboot, so we're good there.

Thanks for all the help!

metal4lyf avatar Mar 10 '21 00:03 metal4lyf

Just to be clear: are you now able to switch your Dell boxes to OL8 and still boot? I'm not sure if there's still an outstanding issue or not, and I wanted to check before I close this.

Djelibeybi avatar Mar 10 '21 01:03 Djelibeybi

Yes, it's working now with the efibootmgr fix. Here's what ultimately works after running centos2ol.sh:

# remove CentOS Linux (it is now unbootable)
efibootmgr -b $(efibootmgr | grep 'CentOS Linux' | sed -r 's/Boot([0-9A-F]+).*/\1/') -B
# remove any Oracle Linux (if it was necessary to convert more than once, existing entries will be unbootable)
efibootmgr -b $(efibootmgr | grep 'Oracle Linux' | sed -r 's/Boot([0-9A-F]+).*/\1/') -B
# add new entry for Oracle Linux
disk=/dev/$(lsblk -o MOUNTPOINT,PKNAME,KNAME | grep /boot/efi | awk '{print $2}')
part=$(lsblk -o MOUNTPOINT,PKNAME,KNAME | grep /boot/efi | awk '{print $3}' | grep -o '[0-9]*$')
efibootmgr -c -d $disk -p $part -L "Oracle Linux" -l "\EFI\redhat\shimx64.efi"

metal4lyf avatar Mar 10 '21 01:03 metal4lyf

Thanks. I've updated the issue title so that we can use it as a reference for any submitted pull requests.

Djelibeybi avatar Mar 10 '21 02:03 Djelibeybi

Referemcing another issue with similar reason https://github.com/oracle/centos2ol/issues/73

aburmash avatar Mar 30 '21 15:03 aburmash