talos icon indicating copy to clipboard operation
talos copied to clipboard

talos.inline.config not working in Omni generated VMWare ova image

Open sedh-sbab opened this issue 1 year ago • 11 comments

Bug Report

When I generate a Talos v1.8.1 image for VMWare platform using omnictl our CA Certificate config is ignored when using the talos.config.inline kernel args and we get x509 Certificate errors in the console log of the Talos VM when trying to connect to our Omni instance.

I provide --extra-kernel-args talos.config.inline=${TALOS_CONFIG_INLINE} to omnictl where $TALOS_CONFIG_INLINE is created by using the guide here https://www.talos.dev/v1.8/reference/kernel/#talosconfiginline. The config document is a CA certificate, see https://www.talos.dev/v1.8/talos-guides/configuration/certificate-authorities/#appending-the-certificate-authority. I have tried using the offical factory.talos.dev with the same result. I have checked the GRUB menu and the talos.config.inline key and value is present.

If I instead provide the same CA certificate config document as a base64 encoded string and instead use the VMware guestinfo the CA certificate works great and the node can connect to our Omni instance without any errors. I use this command to insert the config document to the VM host,

govc vm.change \
  -e "guestinfo.talos.config=$(cat ca-root-config.yml | base64)"
....

I have tried to wipe and reset the machine and edit the kernel arguments to change the platform and remove the talos.config=guestinfo line without any luck. But am not sure it has anything to do with this.

Platform: VMWare (OVA template) Talos Version: v1.8.1

sedh-sbab avatar Oct 15 '24 15:10 sedh-sbab

Please provide kernel logs.

P.S. It's way better to use userdata than talos.config.inline with Omni.

smira avatar Oct 15 '24 15:10 smira

The kernel log: the best I can do is an image, hope that works image The rest of the logs are mostly from time.syncController that can't connect out to internet.

The status of the node stays like this forever: image

P.S For the userdata part, that actually sounds very reasonable, since it don't have quite the same limitations. Thank you.

sedh-sbab avatar Oct 15 '24 16:10 sedh-sbab

We need full kernel logs, (serial console logs) to understand why the config failed to load. We can't debug much without it, sorry.

smira avatar Oct 15 '24 16:10 smira

We need full kernel logs, (serial console logs) to understand why the config failed to load. We can't debug much without it, sorry.

I'll check if I can attach a serial and save to disk

sedh-sbab avatar Oct 15 '24 16:10 sedh-sbab

Here it is! I have redacted the sensitive information. console-log.txt

On line 15 and 104 the talos.config.inline is clearly missing. I can see it in the GRUB menu though.

sedh-sbab avatar Oct 15 '24 16:10 sedh-sbab

If you're booting from the OVA, it should be there, unless there was something else happening (like an upgrade) which would wipe that kernel argument?

smira avatar Oct 15 '24 17:10 smira

It's very strange, I have to do some more digging. But no upgrade or any adjustments are made, they are clearly visible in the grub edit menu. Steps are,

  1. Generate with omnictl
  2. Upload to our content directory with govc
  3. Deploy it. (I make no adjustments or modifications in this step, simply New VM from template)
  4. Start

These console logs are of a completely fresh machine I created.

Here is the full omnictl command with expanded variables:

omnictl download vmware \
      --talos-version v1.8.1 \
      --arch amd64 \
      --extensions vmtoolsd-guest-agent \
      --initial-labels environment=<env> --initial-labels region=<REGION> \
      --extra-kernel-args talos.config.inline=$(cat sbab-root-ca.yml | zstd --compress --ultra -22 | base64 -w 0) \
     --output _out/v1.8.1-<REGION>-common

GRUB image: image

sedh-sbab avatar Oct 15 '24 17:10 sedh-sbab

I wonder if it's too big and gets cut by GRUB... maybe your certificate is RSA? ECDSA is way smaller

smira avatar Oct 15 '24 18:10 smira

Well in totalt with our talos.config.inline the whole command is 2700 bytes.

BOOT_IMAGE=/A/vmlinuz talos.platform=vmware talos.config=guestinfo console=tty0 console=ttyS0 earlyprintk=ttyS0,115200 net.ifnames=0 init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512 siderolink.api=https://<REDACTED>:443?grpc_tunnel=false&jointoken=<REDACTED> talos.events.sink=[fdae:41e4:649b:9303::1]:8091 talos.logging.kernel=tcp://[fdae:41e4:649b:9303::1]:8092 talos.config.line=<2213 bytes>

This is without the redacted stuff.

❯ wc -c talos-kernel-args.txt
2700 talos-kernel-args.txt

In your documentation it says the Linux kernel args has a max size of 4096, but maybe grub has another limit?

sedh-sbab avatar Oct 16 '24 06:10 sedh-sbab

yes, it might be GRUB or the boot protocol used with GRUB limit (I guess you're booting in BIOS mode on VMWare?)

smira avatar Oct 16 '24 09:10 smira

yes, it might be GRUB or the boot protocol used with GRUB limit (I guess you're booting in BIOS mode on VMWare?)

Yes, BIOS.

sedh-sbab avatar Oct 16 '24 10:10 sedh-sbab

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Apr 15 '25 02:04 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Apr 20 '25 02:04 github-actions[bot]