[BUG] 6.1 ISO Loops Over Installing Files Endlessly and Never Completes.
What steps did you take and what happened:
- Install Elemental from Rancher
- Create a Machine Registration with
config.elemental.install.snapshotter.type=loopback - Create Install Media with SLE Micro 6.1 v2.2.0-4.1 or 4.2 or 4.3
- Start an install with the media
- The installer will loop installing files to the disk endlessly. The installer never completes and the files will always be copying.
What did you expect to happen: Installation completes and the the node reboots or shutdown.
Anything else you would like to add:
Here's part of the config.elemental from the machineregistration.
elemental:
install:
device: /dev/nvme0n1
reboot: true
snapshotter:
type: loopdevice
registration:
auth: tpm
emulate-tpm: true
emulated-tpm-seed: -1
reset:
enabled: true
reboot: true
reset-oem: true
reset-persistent: true
Environment:
- Elemental Version from Helm: 1.6.8
- Rancher version: 2.11.2
- Kubernetes Version: v1.32.5 +k3s1
- Cloud provider or hardware configuration: Baremetal
Workarounds:
Change the spec.elemental.install.snapshotter.type to btfs
Possible alternate: Use a 6.0 iso and upgrade after.
Did anyone need anything extra for this? Video or something?
Hello, thanks for your report! I tried on my lab with the same version of Rancher/K3s/emulated TPM/baremetal ISO image (v2.2.0-4.3) and with Elemental v1.6.9 (it's a version with minor fixes but that should not have any impact). I payed attention to use loopdevice snapshotter and not the default btrfs one also.
Long story short: I was not able to reproduce your issue, sorry!
To try to help you could you:
- Try with the Elemental operator v1.6.9? I really don't expect this to have an impact, but better safe than sorry!
- Tell me more about the machine you want to install (model, RAM, CPU, nvme drive type and size, etc.)? It's to try to have something as close you in my tests.
As a reference, my tests were done on a VM with this configuration:
- RAM: 4GB
- CPU: 1 core
- HDD type: nvme
- HDD size: 30GB
- TPM: emulated through Elemental
Thanks!
* Try with the Elemental operator v1.6.9? I really don't expect this to have an impact, but better safe than sorry!
I'd need to wait for a maintenance window to update production, which can't happen until the end of the Term (December) but I'll see if I can stand up a dev env for this.
* Tell me more about the machine you want to install (model, RAM, CPU, nvme drive type and size, etc.)? It's to try to have something as close you in my tests.
We're installing onto bare metal. They're fairly large boxes with NVME disks. I think they have a 900Gi disks for the OS and 4 1.5 Ti disks for longhorn. We typically throw those larger disks into an LVM, which gets mounted via the cloud-config.
As a reference, my tests were done on a VM with this configuration:
* RAM: 4GB * CPU: 1 core * HDD type: nvme * HDD size: 30GB * TPM: emulated through Elemental
Major differences is that we're only installing on bare metal. Typically our boxes are Gigabyte AMD and have 128 cores (256 threads) and 512 Gi but this happens on our older Dell Models as well. They have multiple disks NVME for the newer ones and SSD/HDD for the older ones and have to have a disk selector in the registration endpoint. We're also doing emulated TPM via elemental.
Here's a only slightly redacted cloud config section from a bugged endpoint:
config:
cloud-config:
runcmd:
- mkdir -p /var/lib/longhorn
- >-
echo '/dev/mapper/longhorn--big-lvol0 /var/lib/longhorn ext4
defaults,nofail 0 0' >> /etc/fstab
- vgscan --mknodes -v && lvscan -v && vgchange -a y longhorn-big
- mount /dev/mapper/longhorn--big-lvol0 /var/lib/longhorn
- modprobe rbd
- >-
echo
'PATH=/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/opt/rke2/bin'
>> /etc/systemd/system/rancher-system-agent.env
- systemctl daemon-reload
users:
- name: root
passwd: ThisIsNotASafePassword
ssh_authorized_keys:
- >-
<YourKeyHere>
elemental:
install:
device: /dev/nvme0n1
reboot: true
snapshotter:
type: loopdevice
registration:
auth: tpm
emulate-tpm: true
emulated-tpm-seed: -1
reset:
enabled: true
reboot: true
reset-oem: true
reset-persistent: true
machineInventoryAnnotations: {}
machineInventoryLabels:
rke2-role: worker
machineName: ${System Data/Runtime/Hostname}
I don't really want to post the resulting iso here on GitHub since it has a registration key, but I'd be happy to share one if you want to reach out to me on the Rancher Users slack space.
I was able to replicate it on 1.6.9 in an non-prime cluster. I had to do a little network change to simplify the registration cloud config:
config:
cloud-config:
runcmd:
- >-
echo
'PATH=/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/opt/rke2/bin'
>> /etc/systemd/system/rancher-system-agent.env
- systemctl daemon-reload
users:
- name: root
passwd: ThisIsNotASafePassword
ssh_authorized_keys: <redacted>
elemental:
install:
device-selector:
- key: Name
operator: In
values:
- /dev/nvme0n1
- /dev/nvme1n1
- /dev/nvme2n1
- /dev/nvme3n1
- /dev/nvme4n1
- /dev/sda
- /dev/vda
- /dev/sdd
- key: Size
operator: Gt
values:
- 25Gi
- key: Size
operator: Lt
values:
- 900Gi
reboot: true
snapshotter:
type: loopdevice
registration:
auth: tpm
emulate-tpm: true
emulated-tpm-seed: -1
reset:
enabled: true
reboot: true
reset-oem: true
reset-persistent: true
machineInventoryAnnotations: {}
machineInventoryLabels:
rke2-role: worker
machineName: ${System Data/Runtime/Hostname}
Here's a video of the loop: (it'll continue forever until you reboot the box)
https://github.com/user-attachments/assets/4284beb0-ce6b-46dd-b5f1-231d1291120f
Thanks for the information. I will try something on my lab but with your video I can see that the OS is installed on nvme0n1 AND nvme1n1. So I assume that your server has at least 2 nvme drives. And as your device-selector contains both drives that could be the issue. AFAIK the way device-selector works is not to end when the first one is found, but in that case I don't know why it works with OS image based on SLMicro 6.0.
Anyway, I will do more test with the informations you provided.
I will try something on my lab but with your video I can see that the OS is installed on nvme0n1 AND nvme1n1. So I assume that your server has at least 2 nvme drives. And as your device-selector contains both drives that could be the issue
I don't think so? From the first example I sent, it was hard coded to the cloud config to nvme0n1. There's 4-6 nvme disks on these boxes and the BIOS doesn't always order them the same way.
When I pick BTFS for the snap, it installs on one disk and then reboots. The video shows it picking nvme0 multiple times, then doing 1 occasionally? I think it might just be the selector rotating through in the loop with some sort of race condidtion that is excluding nvme0 sometimes.
Either way, hardcoding it to nvme0 reproduces the same result.
@ParadoxGuitarist Are you able to get the journal for the elemental-register-install.service unit?
Looks like we found the culprit:
Oct 16 14:46:00 kube18.node elemental-register[5339]: panic: runtime error: index out of range [0] with length 0
Oct 16 14:46:00 kube18.node elemental-register[5339]: goroutine 1 [running]:
Oct 16 14:46:00 kube18.node elemental-register[5339]: github.com/rancher/elemental-toolkit/v2/pkg/snapshotter.(*LoopDevice).cleanOldSnapshots(0xc000786000)
This looks to be the same issue that was fixed in https://github.com/rancher/elemental-toolkit/pull/2293. If I remember correctly it was backported to version v2.2.4. I will check if we need to refresh our channels for the 6.1 stream!
Also running elemental-operator 1.7 should fix this, since it sets the max-snaps to 2 as expected by the toolkit.