kairos icon indicating copy to clipboard operation
kairos copied to clipboard

debian: Unable to boot Kairos installer

Open 6ixfalls opened this issue 9 months ago • 33 comments

Kairos version:

Fails to boot on kairos-debian-bookworm-standard-amd64-generic-v3.0.8-k3sv1.29.3+k3s1, success on kairos-debian-bookworm-standard-amd64-generic-v3.0.0-k3sv1.29.0+k3s1

CPU architecture, OS, and Version:

Linux localhost 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

(output from v3.0.0)

Describe the bug

The Kairos ISO is unable to boot and I'm unable to install Kairos (manually and automatically).

To Reproduce

Try to install Kairos on the latest version, not sure if this is reproducible. This is running in a KVM VM.

Expected behavior

I should be able to boot into the Kairos install ISO.

Logs

image

Additional context

This bug looks exactly like #2467, but trying the fix there and adding that to a Dockerfile doesn't resolve the issue.

6ixfalls avatar Apr 30 '24 04:04 6ixfalls

Hello, 6ixfalls! I'm an automated bot assisting with Github issue audits in the kairos project. I've added the 'question' label to your issue (#2522) because it appears we need more information to properly investigate your report.

To enhance our understanding and help us better address your problem, please provide:

  • A clear description of the issue you're experiencing, including any error messages you receive.
  • Steps to reproduce the problem, if possible.
  • The versions of all relevant artifacts you're using, such as the Kairos version, CPU architecture, OS version, and any specific configurations or dependencies.

Please ensure that your description, steps to reproduce, and version details are explicitly mentioned in your issue. We appreciate your efforts to help us improve Kairos, and don't hesitate to reach out if you have any questions. Note that I am a bot, an experiment of @mudler and @jimmykarily.

Thanks! kairos-io Githubbot

ci-robbot avatar Apr 30 '24 04:04 ci-robbot

This could be related, but I'm using a custom docker image with auroraboot to generate an ISO. The Dockerfile is here: https://github.com/6ixfalls/taonet-cloud/blob/main/containers/kairos-debian/Dockerfile

It also appears this issue was introduced between 3.0.0 and 3.0.3 - this appears to be a fix to the issue: https://github.com/tyzbit/kairos-distros/commit/e11addab610b5e01f2c81c6610b62841fbf1a20f

6ixfalls avatar Apr 30 '24 18:04 6ixfalls

A note: that was an attempted fix. It didn't fix it for me on 3.0.3 but I didn't try other versions.

tyzbit avatar Apr 30 '24 21:04 tyzbit

Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10

jimmykarily avatar May 01 '24 07:05 jimmykarily

Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10

oh wait, you are not on nvidia. On the other hand, maybe that module needs to be included somehow (?).

jimmykarily avatar May 01 '24 07:05 jimmykarily

Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10

oh wait, you are not on nvidia. On the other hand, maybe that module needs to be included somehow (?).

maybe not that irrelevant after all: https://forums.fedoraforum.org/showthread.php?325865-dracut-FATAL-iscsi-requested-but-kernel-initrd-does-not-support-iscsi

you could try to omit iscsi in dracut to see if this helps

jimmykarily avatar May 01 '24 07:05 jimmykarily

Maybe this is relevant? https://github.com/kairos-io/packages/blob/718aaa27e4688559433cd889513f1944a7679ef4/packages/static/kairos-overlay-files/files/system/oem/12_nvidia.yaml#L10

oh wait, you are not on nvidia. On the other hand, maybe that module needs to be included somehow (?).

maybe not that irrelevant after all: https://forums.fedoraforum.org/showthread.php?325865-dracut-FATAL-iscsi-requested-but-kernel-initrd-does-not-support-iscsi

you could try to omit iscsi in dracut to see if this helps

I'm not sure if this is how to correctly do it, but I tried this configuration and it did not fix the issue.

6ixfalls avatar May 01 '24 07:05 6ixfalls

If this does what I suspect, this would break compatibility with at least Longhorn. Can we see what it takes for the kernel to support iscsi?

tyzbit avatar May 01 '24 14:05 tyzbit

It looks like we need to disable iscsi as we do already for nvidia: https://github.com/kairos-io/kairos/blob/f5c105009a4df27ee3843bc49167eebc29f19bc7/images/Dockerfile.nvidia#L101

mudler avatar May 06 '24 08:05 mudler

looks like iscsi modules are not properly set in the initramfs as dracut failure indicates that its checking for the iscsi_tcp mod to be available

You could try to install iscsiuio alongside and regenerate the initramfs as that seems to bring the proper iscsi_tcp module needed by dracut

Im gonna try a qucik test here, but I can see already that once installing that package the modules are available and iscsi is added to the dracut modules

what cmdline are you using?

Itxaka avatar May 07 '24 12:05 Itxaka

with a quick patch to install the package alongside Kairos and letting dracut regenerate the initramfs the proper module is available and loaded:

image

Itxaka avatar May 07 '24 12:05 Itxaka

I can confirm that customizing the Debian image (only tested this one) from v3.0.0 and up produces the "iscsi error" for dracut. I followed this doc https://kairos.io/docs/advanced/customizing/ at first. Then I used this docker file (https://github.com/kairos-io/kairos/blob/master/images/Dockerfile.kairos-debian) to rebuild the image from scratch while adding packages I needed. Still the iscsi error from dracut appeared. After that I added the "iscsiuio" package and net booting with Aurora worked... the first time.

The second time I launched Aurora at tried to net boot the server, it gave me the same error. I inspected the temp directory to which Auroraboot extracts the ISO and the /netboot directory contains all the net boot artifacts. I inspected the kernel file and compared it to the kernel files in the ISO (which are unpacked in the temp directory).

I found that the net boot kernel (kairos-kernel) was the oldest kernel file and not the most recent, which is why it did not contain the iscsi module of which dracut complains it is not present in the kernel during net boot. I copied the latest kernel and used the other artifacts in /tmp/netboot to start pixiecore and everything worked as expected.

It looks like Auroraboot is picking the wrong kernel (sometimes) for booting, can you confirm?

athnoc-dev avatar May 08 '24 08:05 athnoc-dev

Let's install iscsiuio by default (all flavors?) so that it makes it to the initramfs.

jimmykarily avatar May 20 '24 08:05 jimmykarily

I tried that and it did not seem to help https://github.com/tyzbit/kairos-distros/commit/e11addab610b5e01f2c81c6610b62841fbf1a20f It does strongly seem to be an AuroraBoot issue

tyzbit avatar May 20 '24 14:05 tyzbit

let's try to replicate in auroraboot and see if we can detect what the issue actually is

mauromorales avatar May 23 '24 08:05 mauromorales

Check which kernel AuroraBoot is using in /tmp/netboot

In my case the errors persisted because an older kernel was used, instead of the latest that had the supporting iscsi modules.

I copied the latest kernel from the temp directory (the unpacked ISO) and replaced the kernel file and all worked fine.

athnoc-dev avatar May 23 '24 09:05 athnoc-dev

  • I created a patch in kairos:
~/workspace/kairos/kairos (master)*$ git diff
diff --git a/images/Dockerfile.debian b/images/Dockerfile.debian
index 39d94482..07862509 100644
--- a/images/Dockerfile.debian
+++ b/images/Dockerfile.debian
@@ -64,6 +64,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     iputils-ping \
     isc-dhcp-common \
     isc-dhcp-client \
+    iscsiuio \
     jq \
     krb5-locales \
     less \
@@ -162,4 +163,4 @@ RUN systemctl enable systemd-networkd
 RUN systemctl enable ssh
 
 # Fixup sudo perms
-RUN chown root:root /usr/bin/sudo && chmod 4755 /usr/bin/sudo
\ No newline at end of file
+RUN chown root:root /usr/bin/sudo && chmod 4755 /usr/bin/sudo
diff --git a/images/Dockerfile.kairos-debian b/images/Dockerfile.kairos-debian
index 60c85c1d..3391363c 100644
--- a/images/Dockerfile.kairos-debian
+++ b/images/Dockerfile.kairos-debian
@@ -63,6 +63,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     iputils-ping \
     isc-dhcp-common \
     isc-dhcp-client \
+    iscsiuio \
     jq \
     krb5-locales \
     less \
  • I built a container image:
earthly +base-image --VARIANT=core --FLAVOR=debian --FLAVOR_RELEASE=bookworm-slim  --BASE_IMAGE=debian:bookworm-slim  --MODEL=generic --FAMILY=debian
  • I started Auroraboot with the result image:
docker run --rm -ti -v /var/run/docker.sock:/var/run/docker.sock --net host quay.io/kairos/auroraboot --set "container_image=docker://quay.io/kairos/debian:bookworm-slim-core-amd64-generic-v3.0.4-73-g8ddb9092-dirty"
  • I started a VM in netboot mode (with virt-manager)

It successfully boots debian.

Since the docker command I used to run Auroraboot didn't mount any volumes, it's not possible to have cached any data between runs. @tyzbit how are you running Auroraboot? @athnoc-dev suggestion makes me think that some people might be using some command (from our docs?) that is mounting a volume and caches things. Is that the case?

jimmykarily avatar May 29 '24 11:05 jimmykarily

Since the docker command I used to run Auroraboot didn't mount any volumes, it's not possible to have cached any data between runs. tyzbit how are you running Auroraboot? athnoc-dev suggestion makes me think that some people might be using some command (from our docs?) that is mounting a volume and caches things. Is that the case?

This is true in my case - I use auroraboot to generate ISOs to upload to my Kairos nodes, and as a result I have a mount so that I can access the completed ISO. I don't think it should be expected behavior for auroraboot to not generate a new kernel if there's an existing one present - but I'm also not sure if reusing the same directory for building has any effect on the speed of the builds themselves either.

I'm actually not too sure if this is a kernel issue, because as far as I remember this issue occurs with a fresh auroraboot install. However, another thing that appears to be common among everyone who has the issue is that the Kairos Dockerfile is modified (is it possible that the Github Action caching the Docker buildsteps leads to this issue?)

6ixfalls avatar May 30 '24 07:05 6ixfalls

I tried provisioning from AuroraBoot where the storage area was ephemeral (and thus it was not possible anything was cached) and I ran into the same issue.

tyzbit avatar Jun 05 '24 02:06 tyzbit

I have tested this on Debian on v3.1.0 and the issue persists. I'm not able to do any upgrades of Kairos or k3s/k8s as a result.

tyzbit avatar Jul 15 '24 23:07 tyzbit

Same problem here with that: docker run --rm -ti -v /tmp/temp-rootfs:/tmp/temp-rootfs -v "$PWD"/config.yaml:/config.yaml --net host quay.io/kairos/auroraboot \ --set "container_image=quay.io/kairos/debian:bookworm-standard-amd64-generic-v3.1.1-k3sv1.28.9-k3s1" --cloud-config /config.yaml

[    OK ] Finished systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
[    OK ] Finished systemd-tmpfiles-setup.service - Create Volatile Files and Directories.
[7.842662] dracut: FATAL: iscsiroot requested but kernel/initrd does not support iscsi
[7.843471] dracut: Refusing to continue
[7.875072] systemd-shutdown[1]: Syncing filesystems and block devices.
[7.876832] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[7.879261] systemd-journald[210]: Received SIGTERM from PID 1 (systemd-shutdow).
[7.880814] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[7.883059] systemd-shutdown[1]: Unmounting file systems.
[7.884469] (sd-umount)[349]: Unmounting '/run/credentials/systemd-tmpfiles-setup.service'.
[7.886024] (sd-umount)[350]: Unmounting '/run/credentials/systemd-tmpfiles-setup-dev.service'.
[7.887561] (sd-umount)[351]: Unmounting '/run/credentials/systemd-sysctl.service'.
[7.889076] (sd-umount)[352]: Unmounting '/run/credentials/systemd-sysusers.service'.
[7.890581] (sd-remount)[353]: Remounting '/' read-only with options 'ro'.
[7.891765] systemd-shutdown[1]: All filesystems unmounted.
[7.892697] systemd-shutdown[1]: Deactivating swaps.
[7.893512] systemd-shutdown[1]: All swaps deactivated.
[7.894312] systemd-shutdown[1]: Detaching loop devices.
[7.895201] systemd-shutdown[1]: All loop devices detached.
[7.896000] systemd-shutdown[1]: Stopping MD devices.
[7.896944] systemd-shutdown[1]: All MD devices stopped.
[7.897726] systemd-shutdown[1]: Detaching DM devices.
[7.898556] systemd-shutdown[1]: All DM devices detached.
[7.899336] systemd-shutdown[1]: All filesystems, swaps, loop devices, MD devices and DM devices detached.
[7.901093] systemd-shutdown[1]: Syncing filesystems and block devices.
[7.901890] systemd-shutdown[1]: Halting system.
[7.941757] reboot: System halted

Some for the ubuntu:22.04 image: docker run --rm -ti -v /tmp/temp-rootfs:/tmp/temp-rootfs -v "$PWD"/config.yaml:/config.yaml --net host quay.io/kairos/auroraboot \ --set "container_image=quay.io/kairos/ubuntu:22.04-standard-amd64-generic-v3.0.14-k3sv1.28.5-k3s1" --cloud-config /config.yaml Thanks for your help.

chris2k20 avatar Jul 24 '24 11:07 chris2k20

I took your Dockerfile and removed the parts that don't work for me:

FROM quay.io/kairos/debian:bookworm-standard-amd64-generic-v3.1.1-k3sv1.30.2-k3s1

#COPY rootfs/ /

RUN apt-get update && \
    apt-get install -y \
    bc=1.07.* \
    bluetooth=5.66-* \
    dbus-broker=33-* \
    # for i915
    fancontrol=1:3* \
    htop=3.2.2* \
    iotop=0.6-* \
    nethogs=0.8.7* \
    iscsiuio=2.1.8-1 \
    smartmontools=7.3-* \
    usbutils=1:014-* \
    wget=1.21.* \
    && \
    apt-get clean && rm -rf /var/lib/apt/lists/* && \
    echo "TYZBIT_HOME_URL=https://github.com/tyzbit/kairos-distros" >> /etc/os-release && \
    echo "TYZBIT_VARIANT=debian" >> /etc/os-release && \
    systemctl enable dbus-broker.service

# Update kernel modules
RUN kernel=$(ls /lib/modules | head -n1) && \
    depmod -a "${kernel}" && \
    dracut -f "/boot/initrd-${kernel}" "${kernel}" && \
    ln -sf "initrd-${kernel}" /boot/initrd

(removed the firmware packages that weren't available and the COPY rootfs/ / part, because I don't have the directory and I don't know what it contains (maybe repository configuration for the missing packages?)

I built an image out of it with: docker built -t tyzbit-debian-image . and I started auroraboot with:

docker run --rm -ti -v /var/run/docker.sock:/var/run/docker.sock --net host quay.io/kairos/auroraboot --set "container_image=docker://tyzbit-debian-image"

Then I started qemu, configured to boot from network first. This is the output of the auroraboot container:

8:04AM INF Pulling container image 'tyzbit-debian-image' to '/tmp/temp-rootfs' (local: true)
8:04AM INF Generating iso 'kairos' from '/tmp/temp-rootfs' to '/tmp/build'
8:06AM INF Extracting netboot artifacts 'kairos' from '/tmp/build/kairos.iso' to '/tmp/netboot'
8:06AM INF Listening on :8080...
8:06AM INF Start pixiecore
2024/09/05 08:12:29 DHCP: Offering to boot 52:54:00:0b:1e:22
2024/09/05 08:12:31 DHCP: Offering to boot 52:54:00:0b:1e:22
2024/09/05 08:12:32 TFTP: Sent "52:54:00:0b:1e:22/4" to 192.168.122.163:27833
2024/09/05 08:12:33 DHCP: Offering to boot 52:54:00:0b:1e:22
2024/09/05 08:12:33 HTTP: Sending ipxe boot script to 192.168.122.163:1025
2024/09/05 08:12:33 HTTP: Sent file "kernel" to 192.168.122.163:1025
2024/09/05 08:12:33 HTTP: Sent file "initrd-0" to 192.168.122.163:1025
2024/09/05 08:12:43 HTTP: Sent file "other-0" to 192.168.122.163:50414
2024/09/05 08:12:57 HTTP: Sent file "other-1" to 192.168.122.163:46738

and the VM boots just fine :shrug: .

Maybe what you have in the rootfs directory is making it fail for you somehow?

jimmykarily avatar Sep 05 '24 08:09 jimmykarily

I commented out the iscsiuio=2.1.8-1 \ line in the Dockerfile, to make sure it's the actual fix. It turns out, the VM boots fine even without it. I can't recall if we every managed to reproduce the issue and looking at the comments above, we only seem to have tried with the possible "fix" not without it.

So we are one step backwards now, we need to reproduce the issue before we can say we have a fix. The question is how? Are there any specific steps that we can take to make this happen locally using qemu?

jimmykarily avatar Sep 05 '24 08:09 jimmykarily