VM loop device not cleaned up in CI
* Removing loopback mount of file /code/qemu-1.img.
previous state:
loop3p1 (253:0)
/dev/loop1: [2065]:72205 (/var/lib/snapd/snaps/core20_2015.snap)
/dev/loop2: [2065]:72206 (/var/lib/snapd/snaps/snapd_20290.snap)
/dev/loop0: [2065]:72204 (/var/lib/snapd/snaps/lxd_24322.snap)
/dev/loop3: [2065]:282883 (/code/qemu-1.img)
after kpartx-d
loop3p1 (253:0)
/dev/loop1: [2065]:72205 (/var/lib/snapd/snaps/core20_2015.snap)
/dev/loop2: [2065]:72206 (/var/lib/snapd/snaps/snapd_20290.snap)
/dev/loop0: [2065]:72204 (/var/lib/snapd/snaps/lxd_24322.snap)
/dev/loop3: [2065]:282883 (/code/qemu-1.img)
loop_part is: loop3p1
loop3p1 (253:0)
/dev/loop1: [2065]:72205 (/var/lib/snapd/snaps/core20_2015.snap)
/dev/loop2: [2065]:72206 (/var/lib/snapd/snaps/snapd_20290.snap)
/dev/loop0: [2065]:72204 (/var/lib/snapd/snaps/lxd_24322.snap)
/dev/loop3: [2065]:282883 (/code/qemu-1.img)
* Finished execution of grml-debootstrap. Enjoy your Debian system.
At least in GitHub Actions the cleanup of the loop device doesn't seem to work properly.
Also modprobe loop is failing as I mentioned in https://github.com/grml/grml-debootstrap/pull/248#issuecomment-1817382866 - same issue or separate issue?
Separate issue, I'd think. The loop device generally works there.
Got any (CI) log where this can be seen?
Maybe a github actions upstream bug?
Do you think you could come up with minimal code for reproduction? Then this could be reported to github actions.
Here:
https://github.com/grml/grml-debootstrap/actions/runs/6922515270/job/18829335284?pr=250#step:5:35
I don't fully understand that code. However, to report this bug to github actions we'd need a tiny script as minimal and simple as possible. Surely not using docker if avoidable and certainly not mentioning grml-debootstrap.
qemu-img, parted, kpartx, losetup, mount... Which are the minimal steps required to reproduce this on github CI?
Maybe there's already an open bug report: https://github.com/actions/runner/issues
Maybe not a github actions bug.
Here people had a similar issues:
- https://unix.stackexchange.com/questions/342463/how-to-mount-multiple-partitions-from-disk-image-simultaneously
- https://forums.raspberrypi.com/viewtopic.php?t=190154
Someone indicated using losetup with -P --partscan might help.
-P, --partscan
Force the kernel to scan the partition table on a newly created loop device. Note that the partition table parsing depends on sector sizes. The default is sector size is 512 bytes, otherwise you need to use the option --sector-size together with --partscan.
Are more important takeaway might be that one cannot (easily) mount the "same" image twice. Does your code attempt to mount both images at the same time?
It's not the same file but the images created by your scripts might look confusingly similar to the Linux coreutils.
Here is how others fixed a similar issue by using mount with sizelimit but I think this might not be applicable here.
https://github.com/ryankurte/docker-rpi-emu/commit/a66a9667bdf0745379e2fbe221ecbed309669441
Would it be an option for you to modify your PR to mount only 1 image at a time to work around this bug?
From above forum topic a user suggested:
You don't need to create a loop device, using the "loop" parameter in the mount command suffice.
mount -o loop,offset=$((98304*512)),sizelimit=1753219072 /srv/raspi/current/2019-04-08-raspbian-stretch-lite.img /mnt
Not sure to grml-debootstrap could do something similar, i.e. avoid kpartx / losetup. Using offset might be more complicated and error prone.
No, the problem here is like this:
- grml-debootstrap puts the img file onto a loop device, so it can modify the partitions in the image. And it really wants the loop device with partitions, so it can modify the EFI partition and the root filesystem, and delegate placement of everything to fdisk etc.
- When grml-debootstrap is done, the image should not be attached to a loop device. This fails for unknown reasons.
- Later the CI scripts try to mount the image again, and this "obviously" fails because step 2 failed.
If grml-debootstrap weren't a shell script I'd try replacing losetup/(k)partx/... with syscalls, but alas...
syscalls might help with debugging and finding out what the issue is but generally I think it's better to stick with the Linux coreutils.
There was a mysterious kpartx in the past that might still not be fully / cleanly fixed. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=734794
If there's anything similar would be good to get that reported upstream.
Are you sure about the offset? I don't know where the number 4194304 is coming from.
Maybe replace the mount using offset with the usual way of doing this?
Could you add additional debug output please?
- Always use
kpartxwith-v. - Always use
losetupwith-v. - Always use
dmsetupwith-v. - Run
mountbefore and after.
There was a mysterious kpartx in the past that might still not be fully / cleanly fixed.
Yeah, I was generally thinking we could switch from kpartx to partx, as thats in util-linux. But I haven't investigated this option.
Are you sure about the offset? I don't know where the number 4194304 is coming from.
The offset is correct for the specific configuration tested; but this is exactly why I don't want to deal with offsets. (k)partx does this calculation, and I don't want to write code for parsing partition tables... (Comment above the number explains where it comes from.)
https://github.com/grml/grml-debootstrap/actions/runs/7172550946/job/19529980137?pr=250#step:4:3166
This is from a run with more -v. You can see how kpartx -d apparently did nothing.
./tests/docker-test-b2b.sh: line 19: dmsetup: command not found
./tests/docker-test-b2b.sh: line 19: dmsetup: command not found
sure, but this is a long time after the problem occurred.