sidero icon indicating copy to clipboard operation
sidero copied to clipboard

Workload cluster of Dell R230s with idrac8 stuck at boot prompt after pxe booting

Open fdawg4l opened this issue 1 year ago • 3 comments

Hi,

Whenever I reboot a node which is part of a sidero workload cluster, it pxe boots and gets stuck at the boot prompt. I have to manually log into the iDRAC, connect to the console, and force it to boot from disk for the node to come back up.

It's possible I've configured something incorrectly, but

  • pxe booting works
  • I pxe booted these nodes to create/join a workload cluster
  • ipmi seems to work because after accepting the nodes and creating the cluster in the management cluster, they powered up and installed talos
  • and talosctl shutdown seems to do the right thing on these hosts.

I suspect this is a config error in whatever toggles the boot order via ipmi in the mangement cluster on the workload cluster.

Happy to provide logs, just let me know which are interesting.

Thanks!

tftp stuck

fdawg4l avatar Oct 08 '23 03:10 fdawg4l

BTW, Continuing will simply get us back into the pxe boot. I have to Continue, wait for the broadcom pxe firmware loading message, cancel that by pressing escape, and then the boot order continues and boots off of the primary disk into grub.

fdawg4l avatar Oct 08 '23 03:10 fdawg4l

It's not clear what the problem is, but we recommend using snp.efi instead of ipxe.efi: https://www.sidero.dev/v0.6/getting-started/prereq-dhcp/

smira avatar Oct 09 '23 10:10 smira

Hi I also have some problem but maybe we can resolve both problems here. I did get that problem before so if you update to the snp.efi you should get longer.

My dnsmasq config

      hostNetwork: true
      containers:
        - name: dnsmasq
          args:
            - -d
            - --port=5353
            - --dhcp-range=10.202.53.20,10.202.53.100
            - --dhcp-option=option:router,10.202.53.1
            - --dhcp-option=6,1.1.1.1
            - --dhcp-boot=tag:ipxe,ipxe.efi,10.202.53.11
            - --addn-hosts=/dnsmasq/hosts.text
            - --dhcp-hostsfile=/dnsmasq/dhcphosts.txt
            - --log-queries
            - --log-dhcp

Im running the DHCP proxy and it setup the boot.


023/11/14 13:53:55 HTTP GET /boot.ipxe 10.202.53.11:7788
2023-11-14T13:54:04Z	INFO	dhcp-proxy	offering boot response	{"source": "04:32:01:47:34:e0", "server": "10.202.53.11", "boot_filename": "snp.efi"}
2023-11-14T13:54:04Z	INFO	dhcp-proxy	ignoring packet	{"source": "04:32:01:47:34:e0", "reason": "packet is REQUEST, not DISCOVER"}
2023/11/14 13:54:05 HTTP GET /boot.ipxe 10.202.53.11:7788
2023/11/14 13:54:05 HTTP GET /boot.ipxe 10.202.53.11:47833
2023/11/14 13:54:08 HTTP GET /ipxe?uuid=4c4c4544-0038-4810-804e-c4c04f353034&mac=04-32-01-47-34-e0&domain=&hostname=node8&serial=D8HN504&arch=x86_64 10.202.53.58:48382
2023/11/14 13:54:08 Using "agent-amd64" environment
2023/11/14 13:54:08 HTTP GET /env/agent-amd64/vmlinuz 10.202.53.58:48382
2023/11/14 13:54:09 HTTP GET /env/agent-amd64/initramfs.xz 10.202.53.58:48382
2023/11/14 13:54:15 HTTP GET /boot.ipxe 10.202.53.11:47833

From the logs we can see that the proxy switches the bootfile to the new snp.efi

Still my boot get stuck in a efi stub mesuredata into pcr 9 Im using a 10G card next time in the datacenter I will try the 1G card.

What in the world can "efi stub mesuredata into pcr 9" be ?

(Om booting from my VM and they use undionly.kpxe and are working good)

mattiashem avatar Nov 14 '23 14:11 mattiashem