elemental icon indicating copy to clipboard operation
elemental copied to clipboard

[BUG] 6.1 ISO Loops Over Installing Files Endlessly and Never Completes.

Open ParadoxGuitarist opened this issue 3 months ago • 10 comments

What steps did you take and what happened:

  • Install Elemental from Rancher
  • Create a Machine Registration with config.elemental.install.snapshotter.type=loopback
  • Create Install Media with SLE Micro 6.1 v2.2.0-4.1 or 4.2 or 4.3
  • Start an install with the media
  • The installer will loop installing files to the disk endlessly. The installer never completes and the files will always be copying.

What did you expect to happen: Installation completes and the the node reboots or shutdown.

Anything else you would like to add: Here's part of the config.elemental from the machineregistration.

elemental:
    install:
      device: /dev/nvme0n1
      reboot: true
      snapshotter:
        type: loopdevice
    registration:
      auth: tpm
      emulate-tpm: true
      emulated-tpm-seed: -1
    reset:
      enabled: true
      reboot: true
      reset-oem: true
      reset-persistent: true

Environment:

  • Elemental Version from Helm: 1.6.8
  • Rancher version: 2.11.2
  • Kubernetes Version: v1.32.5 +k3s1
  • Cloud provider or hardware configuration: Baremetal

Workarounds: Change the spec.elemental.install.snapshotter.type to btfs

Possible alternate: Use a 6.0 iso and upgrade after.

ParadoxGuitarist avatar Sep 10 '25 16:09 ParadoxGuitarist

Did anyone need anything extra for this? Video or something?

ParadoxGuitarist avatar Sep 26 '25 21:09 ParadoxGuitarist

Hello, thanks for your report! I tried on my lab with the same version of Rancher/K3s/emulated TPM/baremetal ISO image (v2.2.0-4.3) and with Elemental v1.6.9 (it's a version with minor fixes but that should not have any impact). I payed attention to use loopdevice snapshotter and not the default btrfs one also.

Long story short: I was not able to reproduce your issue, sorry!

To try to help you could you:

  • Try with the Elemental operator v1.6.9? I really don't expect this to have an impact, but better safe than sorry!
  • Tell me more about the machine you want to install (model, RAM, CPU, nvme drive type and size, etc.)? It's to try to have something as close you in my tests.

As a reference, my tests were done on a VM with this configuration:

  • RAM: 4GB
  • CPU: 1 core
  • HDD type: nvme
  • HDD size: 30GB
  • TPM: emulated through Elemental

Thanks!

ldevulder avatar Oct 15 '25 09:10 ldevulder

* Try with the Elemental operator v1.6.9? I really don't expect this to have an impact, but better safe than sorry!

I'd need to wait for a maintenance window to update production, which can't happen until the end of the Term (December) but I'll see if I can stand up a dev env for this.

* Tell me more about the machine you want to install (model, RAM, CPU, nvme drive type and size, etc.)? It's to try to have something as close you in my tests.

We're installing onto bare metal. They're fairly large boxes with NVME disks. I think they have a 900Gi disks for the OS and 4 1.5 Ti disks for longhorn. We typically throw those larger disks into an LVM, which gets mounted via the cloud-config.

As a reference, my tests were done on a VM with this configuration:

* RAM: 4GB

* CPU: 1 core

* HDD type: nvme

* HDD size: 30GB

* TPM: emulated through Elemental

Major differences is that we're only installing on bare metal. Typically our boxes are Gigabyte AMD and have 128 cores (256 threads) and 512 Gi but this happens on our older Dell Models as well. They have multiple disks NVME for the newer ones and SSD/HDD for the older ones and have to have a disk selector in the registration endpoint. We're also doing emulated TPM via elemental.

Here's a only slightly redacted cloud config section from a bugged endpoint:

config:
  cloud-config:
    runcmd:
      - mkdir -p /var/lib/longhorn
      - >-
        echo '/dev/mapper/longhorn--big-lvol0 /var/lib/longhorn   ext4   
        defaults,nofail 0 0' >> /etc/fstab
      - vgscan --mknodes -v && lvscan -v && vgchange -a y longhorn-big
      - mount /dev/mapper/longhorn--big-lvol0 /var/lib/longhorn
      - modprobe rbd
      - >-
        echo
        'PATH=/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/opt/rke2/bin'
        >> /etc/systemd/system/rancher-system-agent.env
      - systemctl daemon-reload
    users:
      - name: root
        passwd: ThisIsNotASafePassword
        ssh_authorized_keys:
          - >-
            <YourKeyHere>
  elemental:
    install:
      device: /dev/nvme0n1
      reboot: true
      snapshotter:
        type: loopdevice
    registration:
      auth: tpm
      emulate-tpm: true
      emulated-tpm-seed: -1
    reset:
      enabled: true
      reboot: true
      reset-oem: true
      reset-persistent: true
machineInventoryAnnotations: {}
machineInventoryLabels:
  rke2-role: worker
machineName: ${System Data/Runtime/Hostname}

I don't really want to post the resulting iso here on GitHub since it has a registration key, but I'd be happy to share one if you want to reach out to me on the Rancher Users slack space.

ParadoxGuitarist avatar Oct 15 '25 15:10 ParadoxGuitarist

I was able to replicate it on 1.6.9 in an non-prime cluster. I had to do a little network change to simplify the registration cloud config:

config:
  cloud-config:
    runcmd:
      - >-
        echo
        'PATH=/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/opt/rke2/bin'
        >> /etc/systemd/system/rancher-system-agent.env
      - systemctl daemon-reload
    users:
      - name: root
        passwd: ThisIsNotASafePassword
        ssh_authorized_keys: <redacted> 
  elemental:
    install:
      device-selector:
        - key: Name
          operator: In
          values:
            - /dev/nvme0n1
            - /dev/nvme1n1
            - /dev/nvme2n1
            - /dev/nvme3n1
            - /dev/nvme4n1
            - /dev/sda
            - /dev/vda
            - /dev/sdd
        - key: Size
          operator: Gt
          values:
            - 25Gi
        - key: Size
          operator: Lt
          values:
            - 900Gi
      reboot: true
      snapshotter:
        type: loopdevice
    registration:
      auth: tpm
      emulate-tpm: true
      emulated-tpm-seed: -1
    reset:
      enabled: true
      reboot: true
      reset-oem: true
      reset-persistent: true
machineInventoryAnnotations: {}
machineInventoryLabels:
  rke2-role: worker
machineName: ${System Data/Runtime/Hostname}

Here's a video of the loop: (it'll continue forever until you reboot the box)

https://github.com/user-attachments/assets/4284beb0-ce6b-46dd-b5f1-231d1291120f

ParadoxGuitarist avatar Oct 15 '25 17:10 ParadoxGuitarist

Thanks for the information. I will try something on my lab but with your video I can see that the OS is installed on nvme0n1 AND nvme1n1. So I assume that your server has at least 2 nvme drives. And as your device-selector contains both drives that could be the issue. AFAIK the way device-selector works is not to end when the first one is found, but in that case I don't know why it works with OS image based on SLMicro 6.0.

Anyway, I will do more test with the informations you provided.

ldevulder avatar Oct 15 '25 19:10 ldevulder

I will try something on my lab but with your video I can see that the OS is installed on nvme0n1 AND nvme1n1. So I assume that your server has at least 2 nvme drives. And as your device-selector contains both drives that could be the issue

I don't think so? From the first example I sent, it was hard coded to the cloud config to nvme0n1. There's 4-6 nvme disks on these boxes and the BIOS doesn't always order them the same way.

When I pick BTFS for the snap, it installs on one disk and then reboots. The video shows it picking nvme0 multiple times, then doing 1 occasionally? I think it might just be the selector rotating through in the loop with some sort of race condidtion that is excluding nvme0 sometimes.

Either way, hardcoding it to nvme0 reproduces the same result.

ParadoxGuitarist avatar Oct 15 '25 19:10 ParadoxGuitarist

@ParadoxGuitarist Are you able to get the journal for the elemental-register-install.service unit?

frelon avatar Oct 16 '25 07:10 frelon

@frelon

redacted-eri.log

ParadoxGuitarist avatar Oct 16 '25 15:10 ParadoxGuitarist

Looks like we found the culprit:

Oct 16 14:46:00 kube18.node elemental-register[5339]: panic: runtime error: index out of range [0] with length 0
Oct 16 14:46:00 kube18.node elemental-register[5339]: goroutine 1 [running]:
Oct 16 14:46:00 kube18.node elemental-register[5339]: github.com/rancher/elemental-toolkit/v2/pkg/snapshotter.(*LoopDevice).cleanOldSnapshots(0xc000786000)

This looks to be the same issue that was fixed in https://github.com/rancher/elemental-toolkit/pull/2293. If I remember correctly it was backported to version v2.2.4. I will check if we need to refresh our channels for the 6.1 stream!

frelon avatar Oct 17 '25 11:10 frelon

Also running elemental-operator 1.7 should fix this, since it sets the max-snaps to 2 as expected by the toolkit.

frelon avatar Oct 17 '25 11:10 frelon