grml-debootstrap VM loop device not cleaned up in CI

 * Removing loopback mount of file /code/qemu-1.img.
previous state:
loop3p1	(253:0)
/dev/loop1: [2065]:72205 (/var/lib/snapd/snaps/core20_2015.snap)
/dev/loop2: [2065]:72206 (/var/lib/snapd/snaps/snapd_20290.snap)
/dev/loop0: [2065]:72204 (/var/lib/snapd/snaps/lxd_24322.snap)
/dev/loop3: [2065]:282883 (/code/qemu-1.img)
after kpartx-d
loop3p1	(253:0)
/dev/loop1: [2065]:72205 (/var/lib/snapd/snaps/core20_2015.snap)
/dev/loop2: [2065]:72206 (/var/lib/snapd/snaps/snapd_20290.snap)
/dev/loop0: [2065]:72204 (/var/lib/snapd/snaps/lxd_24322.snap)
/dev/loop3: [2065]:282883 (/code/qemu-1.img)
loop_part is: loop3p1
loop3p1	(253:0)
/dev/loop1: [2065]:72205 (/var/lib/snapd/snaps/core20_2015.snap)
/dev/loop2: [2065]:72206 (/var/lib/snapd/snaps/snapd_20290.snap)
/dev/loop0: [2065]:72204 (/var/lib/snapd/snaps/lxd_24322.snap)
/dev/loop3: [2065]:282883 (/code/qemu-1.img)
 * Finished execution of grml-debootstrap. Enjoy your Debian system.

At least in GitHub Actions the cleanup of the loop device doesn't seem to work properly.

Nov 19 '23 19:11 zeha

Also modprobe loop is failing as I mentioned in https://github.com/grml/grml-debootstrap/pull/248#issuecomment-1817382866 - same issue or separate issue?

Nov 20 '23 05:11 adrelanos

Separate issue, I'd think. The loop device generally works there.

Nov 20 '23 08:11 zeha

Got any (CI) log where this can be seen?

Maybe a github actions upstream bug?

Do you think you could come up with minimal code for reproduction? Then this could be reported to github actions.

Dec 08 '23 08:12 adrelanos

Here:

https://github.com/grml/grml-debootstrap/actions/runs/6922515270/job/18829335284?pr=250#step:5:35

Dec 08 '23 09:12 zeha

I don't fully understand that code. However, to report this bug to github actions we'd need a tiny script as minimal and simple as possible. Surely not using docker if avoidable and certainly not mentioning grml-debootstrap.

qemu-img, parted, kpartx, losetup, mount... Which are the minimal steps required to reproduce this on github CI?

Maybe there's already an open bug report: https://github.com/actions/runner/issues

Dec 08 '23 10:12 adrelanos

Maybe not a github actions bug.

Here people had a similar issues:

https://unix.stackexchange.com/questions/342463/how-to-mount-multiple-partitions-from-disk-image-simultaneously
https://forums.raspberrypi.com/viewtopic.php?t=190154

Someone indicated using losetup with -P --partscan might help.

-P, --partscan

Force the kernel to scan the partition table on a newly created loop device. Note that the partition table parsing depends on sector sizes. The default is sector size is 512 bytes, otherwise you need to use the option --sector-size together with --partscan.

Are more important takeaway might be that one cannot (easily) mount the "same" image twice. Does your code attempt to mount both images at the same time?

It's not the same file but the images created by your scripts might look confusingly similar to the Linux coreutils.

Here is how others fixed a similar issue by using mount with sizelimit but I think this might not be applicable here. https://github.com/ryankurte/docker-rpi-emu/commit/a66a9667bdf0745379e2fbe221ecbed309669441

Would it be an option for you to modify your PR to mount only 1 image at a time to work around this bug?

From above forum topic a user suggested:

You don't need to create a loop device, using the "loop" parameter in the mount command suffice. mount -o loop,offset=$((98304*512)),sizelimit=1753219072 /srv/raspi/current/2019-04-08-raspbian-stretch-lite.img /mnt

Not sure to grml-debootstrap could do something similar, i.e. avoid kpartx / losetup. Using offset might be more complicated and error prone.

Dec 08 '23 10:12 adrelanos

No, the problem here is like this:

grml-debootstrap puts the img file onto a loop device, so it can modify the partitions in the image. And it really wants the loop device with partitions, so it can modify the EFI partition and the root filesystem, and delegate placement of everything to fdisk etc.
When grml-debootstrap is done, the image should not be attached to a loop device. This fails for unknown reasons.
Later the CI scripts try to mount the image again, and this "obviously" fails because step 2 failed.

If grml-debootstrap weren't a shell script I'd try replacing losetup/(k)partx/... with syscalls, but alas...

Dec 08 '23 10:12 zeha

syscalls might help with debugging and finding out what the issue is but generally I think it's better to stick with the Linux coreutils.

There was a mysterious kpartx in the past that might still not be fully / cleanly fixed. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=734794

If there's anything similar would be good to get that reported upstream.

Are you sure about the offset? I don't know where the number 4194304 is coming from.

Maybe replace the mount using offset with the usual way of doing this?

Could you add additional debug output please?

Always use kpartx with -v.
Always use losetup with -v.
Always use dmsetup with -v.
Run mount before and after.

Dec 08 '23 11:12 adrelanos

There was a mysterious kpartx in the past that might still not be fully / cleanly fixed.

Yeah, I was generally thinking we could switch from kpartx to partx, as thats in util-linux. But I haven't investigated this option.

Are you sure about the offset? I don't know where the number 4194304 is coming from.

The offset is correct for the specific configuration tested; but this is exactly why I don't want to deal with offsets. (k)partx does this calculation, and I don't want to write code for parsing partition tables... (Comment above the number explains where it comes from.)

Dec 08 '23 12:12 zeha

https://github.com/grml/grml-debootstrap/actions/runs/7172550946/job/19529980137?pr=250#step:4:3166

This is from a run with more -v. You can see how kpartx -d apparently did nothing.

Dec 11 '23 19:12 zeha

./tests/docker-test-b2b.sh: line 19: dmsetup: command not found

Dec 15 '23 12:12 adrelanos

./tests/docker-test-b2b.sh: line 19: dmsetup: command not found

sure, but this is a long time after the problem occurred.

Dec 15 '23 17:12 zeha