grml-debootstrap icon indicating copy to clipboard operation
grml-debootstrap copied to clipboard

amd64 images build under a Debian 12 Docker image end up with no BIOS bootloader due to `lsblk` malfunctioning

Open ArrayBolt3 opened this issue 6 months ago • 16 comments

In chroot-script, a function is_grub_bios_compatible is used to determine whether or not the drive being installed to can have a grub BIOS bootloader installed to it or not. This works for GPT disks by running lsblk -pnlo NAME,PARTTYPE to find all partitions and partition types on the system, then searches the lsblk output for partitions on the device being installed to that have a partition UUID of 21686148-6449-6e6f-744e-656564454649. This works perfectly fine on physical and virtual machines, but it fails badly under Docker, since lsblk -pnlo NAME,PARTTYPE does this under Docker (at least when using an image based on the official Debian 12 image):

lsblk -pnlo NAME,PARTTYPE
/dev/loop1     
/dev/loop2     
/dev/loop3     
/dev/loop4     
/dev/loop5     
/dev/loop6     
/dev/loop7     
/dev/loop8     
/dev/loop9     
/dev/loop10    
/dev/loop11    
/dev/loop12    
/dev/loop13    
/dev/loop14    
/dev/zram0     
/dev/nvme0n1   
/dev/nvme1n1   
/dev/nvme0n1p1 
/dev/nvme0n1p2 
/dev/nvme0n1p3 
/dev/nvme0n1p4 
/dev/nvme1n1p1

All of the partition type UUIDs are missing, so is_grub_bios_compatible assumes the disk does NOT support a BIOS bootloader, and thus installs an EFI-only bootloader.

Will add reproduction steps later, this is sort of a hurried brain dump for now.

ArrayBolt3 avatar Jun 29 '25 19:06 ArrayBolt3

It seems to me this will need a similar solution as debian bug #1108311

zeha avatar Jun 29 '25 20:06 zeha

Steps to reproduce:

  • docker image pull debian:12
  • docker run --name debian-grml-test --interactive --tty --rm --privileged debian:12 /bin/bash
  • lsblk -pnlo NAME,PARTTYPE

Expected result: Partition types should be shown next to some of the partition devices displayed

Actual result: Only partition devices are shown, no partition type UUIDs are shown.

  • apt update
  • apt install dpkg-dev debhelper build-essential git
  • apt install --no-install-recommends asciidoc docbook-xsl shunit2 xsltproc
  • git clone https://github.com/grml/grml-debootstrap.git
  • cd grml-debootstrap
  • dpkg-buildpackage -i -us -uc -b
  • cd ..
  • apt install ./grml-debootstrap_0.121_all.deb
  • grml-debootstrap --release bookworm --arch amd64 --target ./vm.img --force --vmfile --vmefi --password x
  • On the host, NOT in the container, sudo docker cp debian-grml-test:/vm.img ./
  • On the host, sudo chown "$USER:$USER" ./vm.img
  • On the host, qemu-system-x86_64 -drive file=./vm.img,format=raw,if=virtio -m 2G -smp 2 -enable-kvm

Expected result: VM boots.

Actual result: VM hangs at "Booting from Hard Disk..."

  • On the host, qemu-system-x86_64 -drive file=./vm.img,format=raw,if=virtio -m 2G -smp 2 -enable-kvm -bios /usr/share/ovmf/OVMF.fd

Expected and actual result: VM boots.

This is using an Ubuntu 24.04 LTS host for Docker.

Worthy of note, blkid allows reading the partition type UUIDs, so I have no idea what lsblk is doing wrong that it can't figure out docker. It might be possible to use blkid rather than lsblk to work around this, but I've had race condition issues doing that, so I'm not sure if that will be better in the long run or not.

It seems to me this will need a similar solution as debian bug #1108311

hmm... not sure if that's related or not.

ArrayBolt3 avatar Jun 29 '25 20:06 ArrayBolt3

hmm... not sure if that's related or not.

Yes, because what you are seeing here is caused by reusing /dev from the outside. What we really need is a grml-debootstrap-managed /dev.

zeha avatar Jun 29 '25 20:06 zeha

My reproduction steps are actually bad because they fail to take into account that Debian Trixie must be used in order for the partition UUIDs to be able to be changed using parted, you'd have to format a raw image yourself and change the partition UUIDs using fdisk or similar to truly reproduce this, but you get the idea, and the lsblk example shows the issue.

Yes, because what you are seeing here is caused by reusing /dev from the outside. What we really need is a grml-debootstrap-managed /dev.

OK, I guess I don't really understand how things work in this context then. mount | grep /dev reveals that /dev in the container is just a tmpfs, so I don't know what Docker is doing there but it would make sense that it would cause issues. Do you have suggestions on how to implement what you're mentioning? I could give it a shot once I understand the idea.

ArrayBolt3 avatar Jun 29 '25 20:06 ArrayBolt3

Do you have suggestions on how to implement what you're mentioning? I could give it a shot once I understand the idea.

So what I expect we need to do for the other bug:

  • stop mounting the existing /dev, /dev/pts, /run/udev things into the chroot
  • mount a new tmpfs as $chroot/dev
  • mknod a few basic things so installing packages works (null, zero, console, ...)
  • after packages are installed, udevadm settle ... or whatever is the right command to have udev populate the $chroot/dev
  • only then run bootloader install etc

zeha avatar Jun 29 '25 20:06 zeha

That makes sense. I'll give that a shot.

ArrayBolt3 avatar Jun 29 '25 20:06 ArrayBolt3

Running udevadm settle in a chroot resulted in a message Running in chroot, ignoring request. I don't think going through udev is going to work, it doesn't seem to be designed to create a second device tree.

What may work though is "copying" the /dev directory from the host to the chroot, by determining the major and minor device numbers for each device under /dev and mknod'ing new devices with those same numbers in the chroot. That would only take a snapshot of the host's device state, but it would be better than nothing. If we need it to stay up-to-date during the build, we can probably use something like inotify or fanotify.

ArrayBolt3 avatar Jul 01 '25 02:07 ArrayBolt3

Just to make sure it was actually possible, I wrote a proof-of-concept script that can make a "copy" of /dev. It seems to work well enough, the final implementation will have to work around files that are already present in /dev most likely, but that shouldn't be a problem. The script:

#!/bin/bash
# dev-clone.sh - recursively copies a device directory tree such as /dev

main() {
  local src_dir tgt_dir src_entry_list src_dir_list src_dir_item \
    tgt_dir_item src_entry tgt_entry file_type_char dev_id dev_permissions \
    link_target

  src_dir="${1:-}"
  tgt_dir="${2:-}"
  if [ -z "${src_dir}" ] || [ -z "${tgt_dir}" ]; then
    printf '%s\n' 'ERROR: Source and destination must both be specified.'
    exit 1
  fi

  readarray -t src_dir_list < <(find "${src_dir}" -type d)
  readarray -t src_entry_list < <(find "${src_dir}")

  for src_dir_item in "${src_dir_list[@]}"; do
    tgt_dir_item="${src_dir_item/"${src_dir}"/"${tgt_dir}"}"
    mkdir -p "${tgt_dir_item}" || {
      printf '%s\n' "ERROR: Failed to create directory ${tgt_dir_item}!"
      exit 1
    }
  done

  for src_entry in "${src_entry_list[@]}"; do
    [ -d "${src_entry}" ] && continue
    tgt_entry="${src_entry/"${src_dir}"/"${tgt_dir}"}"
    file_type_char="$(head -c1 < <(stat --format='%A' "${src_entry}"))"
    
    case "${file_type_char}" in
      -) # normal file
        cp -a "${src_entry}" "${tgt_entry}"
        ;;
      l) # symbolic link
        link_target="$(readlink "${src_entry}")"
        ln -s "${link_target}" "${tgt_entry}"
        ;;
      c|b) # character or block device
        dev_id="$(stat --format='%Hr %Lr' "${src_entry}")"
        dev_permissions="$(stat --format='%a' "${src_entry}")"
        mknod -m "${dev_permissions}" "${tgt_entry}" "${file_type_char}" \
          ${dev_id}
        ;;
    esac
  done
}

main "$@"

I'll probably be integrating this into grml-debootstrap, but in case someone else sees this and needs it, consider it licensed as GPLv2 or later since that's the license of grml-debootstrap.

@zeha Do you see any problems using this approach?

ArrayBolt3 avatar Jul 01 '25 03:07 ArrayBolt3

Yes, I assume this will not work for VM images, as the by-uuid symlinks etc will not be there.

zeha avatar Jul 01 '25 10:07 zeha

Good point. I guess an inotify-based mechanism or similar will be needed then... either that or else grml-debootstrap will need to explicitly re-sync the cloned dev filesystem at certain critical points, which might be less race-prone.

ArrayBolt3 avatar Jul 01 '25 14:07 ArrayBolt3

Cloning /dev will not fix the docker issue. I just tested it:

  • Enter a Debian 12 docker environment: docker run --name debian-grml-test --interactive --tty --rm --privileged debian:12 /bin/bash
  • Install mmdebstrap and vim
  • Copy the dev-clone.sh script above and paste it into a script /dev-clone.sh inside the container, using Vim to edit
  • chmod +x /dev-clone.sh
  • Spin up a Debian 12 chroot within the container: mmdebstrap bookworm test-chroot
  • Delete test-chroot/dev
  • Re-create it using the clone script: /dev-clone.sh /dev test-chroot/dev
  • Recreate /dev/fd within the chroot because somehow the script missed it: ln -s /proc/self/fd test-chroot/dev/fd
  • Bind mount proc and sys: mount --bind /proc test-chroot/proc; mount --bind /sys test-chroot/sys
  • Enter the chroot: chroot test-chroot
  • Run lsblk -pnlo NAME,PARTTYPE
  • No partition type UUIDs are displayed.

I think Docker itself is at fault here. Other people seem to have had problems using Docker and lsblk together as well: https://forums.docker.com/t/why-does-lsblk-report-partition-as-null-from-within-a-docker-container/139393

ArrayBolt3 avatar Jul 02 '25 00:07 ArrayBolt3

Reported as a bug in Docker: https://github.com/moby/moby/issues/50304

ArrayBolt3 avatar Jul 02 '25 01:07 ArrayBolt3

I'm somewhat sure lsblk needs the udev database to read the UUIDs from. Should probably check the lsblk source.

zeha avatar Jul 02 '25 11:07 zeha

the following works for me (without docker, haven't tried). copying the block devices is necessary probably because i'm doing something else wrong...

#!/bin/bash
set -ex
R=$PWD/chroot

rm -rf chroot
mmdebstrap --mode=unshare --include=udev,systemd-sysv,grub-cloud-arm64,linux-image-arm64 trixie $R

for x in /dev/block/*; do realdev=$(readlink -f $x) ; echo "creating $realdev"; cp -a "$realdev" $R/dev/ ; done

cat >$R/init <<EOT
#!/bin/bash
set -ex
mkdir /dev/pts
mount -t devpts devpts /dev/pts
mount -t tmpfs run /run
mount -t sysfs sysfs /sys
mount


ls -la /dev

mknod /dev/console c 5 1
mknod /dev/null c 1 3
mknod /dev/zero c 1 5
mknod /dev/full c 1 7
ln -s /proc/self/fd/ /dev/fd

echo starting udevd
/usr/lib/systemd/systemd-udevd &
sleep 2

echo trigger
udevadm trigger --type=all --action=add --prioritized-subsystem=module,block,tty -w

echo done
ls -la /dev
ls -la /dev/block

mkdir /boot/grub
mkdir /boot/efi
#grub-install
update-grub

grep root= /boot/grub/grub.cfg

set +ex
exec /bin/bash -i

EOT

chmod a+rx $R/init

unshare --pid --ipc --mount --mount-proc -R $R --fork --kill-child /init

umount --recursive $R/sys $R/proc $R/dev $R/run

...

+ grep root= /boot/grub/grub.cfg
	linux	/boot/vmlinuz-6.12.33+deb13-arm64 root=UUID=fdba34a7-77cf-4724-8e90-d9069fa6ce9e ro console=tty0 console=ttyAMA0
		linux	/boot/vmlinuz-6.12.33+deb13-arm64 root=UUID=fdba34a7-77cf-4724-8e90-d9069fa6ce9e ro console=tty0 console=ttyAMA0
		linux	/boot/vmlinuz-6.12.33+deb13-arm64 root=UUID=fdba34a7-77cf-4724-8e90-d9069fa6ce9e ro single console=tty0 console=ttyAMA0

my current copy of this script is https://salsa.debian.org/zeha/chr/-/blob/main/chr.sh?ref_type=heads

zeha avatar Jul 02 '25 12:07 zeha

copying the block devices is necessary probably because i'm doing something else wrong...

after checking how this works in a normal install, i think this is the right approach.

zeha avatar Jul 02 '25 15:07 zeha

It looks like you're right about udev being needed for this: https://github.com/moby/moby/issues/50304#issuecomment-3029716931 Furthermore, it appears that udev can be used within Docker, and when it is, lsblk behaves itself. If this is the case, this may not be a bug in grml-debootstrap at all, but instead a bug in my Docker environment. I'll do some more testing, if that turns out to be the case we can close this and worry about /dev cloning for the other bug.

ArrayBolt3 avatar Jul 03 '25 04:07 ArrayBolt3