kairos
kairos copied to clipboard
debian: Unable to boot Kairos installer
Kairos version:
Fails to boot on kairos-debian-bookworm-standard-amd64-generic-v3.0.8-k3sv1.29.3+k3s1, success on kairos-debian-bookworm-standard-amd64-generic-v3.0.0-k3sv1.29.0+k3s1
CPU architecture, OS, and Version:
Linux localhost 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
(output from v3.0.0)
Describe the bug
The Kairos ISO is unable to boot and I'm unable to install Kairos (manually and automatically).
To Reproduce
Try to install Kairos on the latest version, not sure if this is reproducible. This is running in a KVM VM.
Expected behavior
I should be able to boot into the Kairos install ISO.
Logs
Additional context
This bug looks exactly like #2467, but trying the fix there and adding that to a Dockerfile doesn't resolve the issue.
Hello, 6ixfalls! I'm an automated bot assisting with Github issue audits in the kairos project. I've added the 'question' label to your issue (#2522) because it appears we need more information to properly investigate your report.
To enhance our understanding and help us better address your problem, please provide:
- A clear description of the issue you're experiencing, including any error messages you receive.
- Steps to reproduce the problem, if possible.
- The versions of all relevant artifacts you're using, such as the Kairos version, CPU architecture, OS version, and any specific configurations or dependencies.
Please ensure that your description, steps to reproduce, and version details are explicitly mentioned in your issue. We appreciate your efforts to help us improve Kairos, and don't hesitate to reach out if you have any questions. Note that I am a bot, an experiment of @mudler and @jimmykarily.
Thanks! kairos-io Githubbot
This could be related, but I'm using a custom docker image with auroraboot to generate an ISO. The Dockerfile is here: https://github.com/6ixfalls/taonet-cloud/blob/main/containers/kairos-debian/Dockerfile
It also appears this issue was introduced between 3.0.0 and 3.0.3 - this appears to be a fix to the issue: https://github.com/tyzbit/kairos-distros/commit/e11addab610b5e01f2c81c6610b62841fbf1a20f
A note: that was an attempted fix. It didn't fix it for me on 3.0.3 but I didn't try other versions.
Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10
Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10
oh wait, you are not on nvidia. On the other hand, maybe that module needs to be included somehow (?).
Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10
oh wait, you are not on nvidia. On the other hand, maybe that module needs to be included somehow (?).
maybe not that irrelevant after all: https://forums.fedoraforum.org/showthread.php?325865-dracut-FATAL-iscsi-requested-but-kernel-initrd-does-not-support-iscsi
you could try to omit iscsi in dracut to see if this helps
Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10
oh wait, you are not on nvidia. On the other hand, maybe that module needs to be included somehow (?).
maybe not that irrelevant after all: https://forums.fedoraforum.org/showthread.php?325865-dracut-FATAL-iscsi-requested-but-kernel-initrd-does-not-support-iscsi
you could try to omit iscsi in dracut to see if this helps
I'm not sure if this is how to correctly do it, but I tried this configuration and it did not fix the issue.
If this does what I suspect, this would break compatibility with at least Longhorn. Can we see what it takes for the kernel to support iscsi?
It looks like we need to disable iscsi as we do already for nvidia: https://github.com/kairos-io/kairos/blob/f5c105009a4df27ee3843bc49167eebc29f19bc7/images/Dockerfile.nvidia#L101
looks like iscsi modules are not properly set in the initramfs as dracut failure indicates that its checking for the iscsi_tcp mod to be available
You could try to install iscsiuio alongside and regenerate the initramfs as that seems to bring the proper iscsi_tcp module needed by dracut
Im gonna try a qucik test here, but I can see already that once installing that package the modules are available and iscsi is added to the dracut modules
what cmdline are you using?
with a quick patch to install the package alongside Kairos and letting dracut regenerate the initramfs the proper module is available and loaded:
I can confirm that customizing the Debian image (only tested this one) from v3.0.0 and up produces the "iscsi error" for dracut. I followed this doc https://kairos.io/docs/advanced/customizing/ at first. Then I used this docker file (https://github.com/kairos-io/kairos/blob/master/images/Dockerfile.kairos-debian) to rebuild the image from scratch while adding packages I needed. Still the iscsi error from dracut appeared. After that I added the "iscsiuio" package and net booting with Aurora worked... the first time.
The second time I launched Aurora at tried to net boot the server, it gave me the same error. I inspected the temp directory to which Auroraboot extracts the ISO and the /netboot directory contains all the net boot artifacts. I inspected the kernel file and compared it to the kernel files in the ISO (which are unpacked in the temp directory).
I found that the net boot kernel (kairos-kernel) was the oldest kernel file and not the most recent, which is why it did not contain the iscsi module of which dracut complains it is not present in the kernel during net boot. I copied the latest kernel and used the other artifacts in /tmp/netboot to start pixiecore and everything worked as expected.
It looks like Auroraboot is picking the wrong kernel (sometimes) for booting, can you confirm?
Let's install iscsiuio
by default (all flavors?) so that it makes it to the initramfs.
I tried that and it did not seem to help https://github.com/tyzbit/kairos-distros/commit/e11addab610b5e01f2c81c6610b62841fbf1a20f It does strongly seem to be an AuroraBoot issue
let's try to replicate in auroraboot and see if we can detect what the issue actually is
Check which kernel AuroraBoot is using in /tmp/netboot
In my case the errors persisted because an older kernel was used, instead of the latest that had the supporting iscsi modules.
I copied the latest kernel from the temp directory (the unpacked ISO) and replaced the kernel file and all worked fine.
- I created a patch in kairos:
~/workspace/kairos/kairos (master)*$ git diff
diff --git a/images/Dockerfile.debian b/images/Dockerfile.debian
index 39d94482..07862509 100644
--- a/images/Dockerfile.debian
+++ b/images/Dockerfile.debian
@@ -64,6 +64,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
iputils-ping \
isc-dhcp-common \
isc-dhcp-client \
+ iscsiuio \
jq \
krb5-locales \
less \
@@ -162,4 +163,4 @@ RUN systemctl enable systemd-networkd
RUN systemctl enable ssh
# Fixup sudo perms
-RUN chown root:root /usr/bin/sudo && chmod 4755 /usr/bin/sudo
\ No newline at end of file
+RUN chown root:root /usr/bin/sudo && chmod 4755 /usr/bin/sudo
diff --git a/images/Dockerfile.kairos-debian b/images/Dockerfile.kairos-debian
index 60c85c1d..3391363c 100644
--- a/images/Dockerfile.kairos-debian
+++ b/images/Dockerfile.kairos-debian
@@ -63,6 +63,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
iputils-ping \
isc-dhcp-common \
isc-dhcp-client \
+ iscsiuio \
jq \
krb5-locales \
less \
- I built a container image:
earthly +base-image --VARIANT=core --FLAVOR=debian --FLAVOR_RELEASE=bookworm-slim --BASE_IMAGE=debian:bookworm-slim --MODEL=generic --FAMILY=debian
- I started Auroraboot with the result image:
docker run --rm -ti -v /var/run/docker.sock:/var/run/docker.sock --net host quay.io/kairos/auroraboot --set "container_image=docker://quay.io/kairos/debian:bookworm-slim-core-amd64-generic-v3.0.4-73-g8ddb9092-dirty"
- I started a VM in netboot mode (with virt-manager)
It successfully boots debian.
Since the docker command I used to run Auroraboot didn't mount any volumes, it's not possible to have cached any data between runs. @tyzbit how are you running Auroraboot? @athnoc-dev suggestion makes me think that some people might be using some command (from our docs?) that is mounting a volume and caches things. Is that the case?
Since the docker command I used to run Auroraboot didn't mount any volumes, it's not possible to have cached any data between runs. tyzbit how are you running Auroraboot? athnoc-dev suggestion makes me think that some people might be using some command (from our docs?) that is mounting a volume and caches things. Is that the case?
This is true in my case - I use auroraboot to generate ISOs to upload to my Kairos nodes, and as a result I have a mount so that I can access the completed ISO. I don't think it should be expected behavior for auroraboot to not generate a new kernel if there's an existing one present - but I'm also not sure if reusing the same directory for building has any effect on the speed of the builds themselves either.
I'm actually not too sure if this is a kernel issue, because as far as I remember this issue occurs with a fresh auroraboot install. However, another thing that appears to be common among everyone who has the issue is that the Kairos Dockerfile is modified (is it possible that the Github Action caching the Docker buildsteps leads to this issue?)
I tried provisioning from AuroraBoot where the storage area was ephemeral (and thus it was not possible anything was cached) and I ran into the same issue.
I have tested this on Debian on v3.1.0 and the issue persists. I'm not able to do any upgrades of Kairos or k3s/k8s as a result.
Same problem here with that:
docker run --rm -ti -v /tmp/temp-rootfs:/tmp/temp-rootfs -v "$PWD"/config.yaml:/config.yaml --net host quay.io/kairos/auroraboot \ --set "container_image=quay.io/kairos/debian:bookworm-standard-amd64-generic-v3.1.1-k3sv1.28.9-k3s1" --cloud-config /config.yaml
[ OK ] Finished systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
[ OK ] Finished systemd-tmpfiles-setup.service - Create Volatile Files and Directories.
[7.842662] dracut: FATAL: iscsiroot requested but kernel/initrd does not support iscsi
[7.843471] dracut: Refusing to continue
[7.875072] systemd-shutdown[1]: Syncing filesystems and block devices.
[7.876832] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[7.879261] systemd-journald[210]: Received SIGTERM from PID 1 (systemd-shutdow).
[7.880814] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[7.883059] systemd-shutdown[1]: Unmounting file systems.
[7.884469] (sd-umount)[349]: Unmounting '/run/credentials/systemd-tmpfiles-setup.service'.
[7.886024] (sd-umount)[350]: Unmounting '/run/credentials/systemd-tmpfiles-setup-dev.service'.
[7.887561] (sd-umount)[351]: Unmounting '/run/credentials/systemd-sysctl.service'.
[7.889076] (sd-umount)[352]: Unmounting '/run/credentials/systemd-sysusers.service'.
[7.890581] (sd-remount)[353]: Remounting '/' read-only with options 'ro'.
[7.891765] systemd-shutdown[1]: All filesystems unmounted.
[7.892697] systemd-shutdown[1]: Deactivating swaps.
[7.893512] systemd-shutdown[1]: All swaps deactivated.
[7.894312] systemd-shutdown[1]: Detaching loop devices.
[7.895201] systemd-shutdown[1]: All loop devices detached.
[7.896000] systemd-shutdown[1]: Stopping MD devices.
[7.896944] systemd-shutdown[1]: All MD devices stopped.
[7.897726] systemd-shutdown[1]: Detaching DM devices.
[7.898556] systemd-shutdown[1]: All DM devices detached.
[7.899336] systemd-shutdown[1]: All filesystems, swaps, loop devices, MD devices and DM devices detached.
[7.901093] systemd-shutdown[1]: Syncing filesystems and block devices.
[7.901890] systemd-shutdown[1]: Halting system.
[7.941757] reboot: System halted
Some for the ubuntu:22.04 image:
docker run --rm -ti -v /tmp/temp-rootfs:/tmp/temp-rootfs -v "$PWD"/config.yaml:/config.yaml --net host quay.io/kairos/auroraboot \ --set "container_image=quay.io/kairos/ubuntu:22.04-standard-amd64-generic-v3.0.14-k3sv1.28.5-k3s1" --cloud-config /config.yaml
Thanks for your help.
I took your Dockerfile and removed the parts that don't work for me:
FROM quay.io/kairos/debian:bookworm-standard-amd64-generic-v3.1.1-k3sv1.30.2-k3s1
#COPY rootfs/ /
RUN apt-get update && \
apt-get install -y \
bc=1.07.* \
bluetooth=5.66-* \
dbus-broker=33-* \
# for i915
fancontrol=1:3* \
htop=3.2.2* \
iotop=0.6-* \
nethogs=0.8.7* \
iscsiuio=2.1.8-1 \
smartmontools=7.3-* \
usbutils=1:014-* \
wget=1.21.* \
&& \
apt-get clean && rm -rf /var/lib/apt/lists/* && \
echo "TYZBIT_HOME_URL=https://github.com/tyzbit/kairos-distros" >> /etc/os-release && \
echo "TYZBIT_VARIANT=debian" >> /etc/os-release && \
systemctl enable dbus-broker.service
# Update kernel modules
RUN kernel=$(ls /lib/modules | head -n1) && \
depmod -a "${kernel}" && \
dracut -f "/boot/initrd-${kernel}" "${kernel}" && \
ln -sf "initrd-${kernel}" /boot/initrd
(removed the firmware packages that weren't available and the COPY rootfs/ /
part, because I don't have the directory and I don't know what it contains (maybe repository configuration for the missing packages?)
I built an image out of it with: docker built -t tyzbit-debian-image .
and I started auroraboot with:
docker run --rm -ti -v /var/run/docker.sock:/var/run/docker.sock --net host quay.io/kairos/auroraboot --set "container_image=docker://tyzbit-debian-image"
Then I started qemu, configured to boot from network first. This is the output of the auroraboot container:
8:04AM INF Pulling container image 'tyzbit-debian-image' to '/tmp/temp-rootfs' (local: true)
8:04AM INF Generating iso 'kairos' from '/tmp/temp-rootfs' to '/tmp/build'
8:06AM INF Extracting netboot artifacts 'kairos' from '/tmp/build/kairos.iso' to '/tmp/netboot'
8:06AM INF Listening on :8080...
8:06AM INF Start pixiecore
2024/09/05 08:12:29 DHCP: Offering to boot 52:54:00:0b:1e:22
2024/09/05 08:12:31 DHCP: Offering to boot 52:54:00:0b:1e:22
2024/09/05 08:12:32 TFTP: Sent "52:54:00:0b:1e:22/4" to 192.168.122.163:27833
2024/09/05 08:12:33 DHCP: Offering to boot 52:54:00:0b:1e:22
2024/09/05 08:12:33 HTTP: Sending ipxe boot script to 192.168.122.163:1025
2024/09/05 08:12:33 HTTP: Sent file "kernel" to 192.168.122.163:1025
2024/09/05 08:12:33 HTTP: Sent file "initrd-0" to 192.168.122.163:1025
2024/09/05 08:12:43 HTTP: Sent file "other-0" to 192.168.122.163:50414
2024/09/05 08:12:57 HTTP: Sent file "other-1" to 192.168.122.163:46738
and the VM boots just fine :shrug: .
Maybe what you have in the rootfs
directory is making it fail for you somehow?
I commented out the iscsiuio=2.1.8-1 \
line in the Dockerfile, to make sure it's the actual fix. It turns out, the VM boots fine even without it. I can't recall if we every managed to reproduce the issue and looking at the comments above, we only seem to have tried with the possible "fix" not without it.
So we are one step backwards now, we need to reproduce the issue before we can say we have a fix. The question is how? Are there any specific steps that we can take to make this happen locally using qemu?