talos icon indicating copy to clipboard operation
talos copied to clipboard

1.12 machineconfig patch issues

Open nicolerenee opened this issue 1 month ago • 9 comments

Bug Report

Description

I'm unable to apply machineconfig patches to any of my v1.12 nodes. I'm running off of a build derived from all the tooling in v1.12.0-alpha.2-6-g64a46a7. I realize it's an alpha build but machine config patches as well as trying to apply them isn't working.

Logs

talosctl -n c0r1-gpu1 patch machineconfig --patch @talos-kubelet-patch.yaml
recovered: expected a mapping node

The patch is simple:

machine:
  kubelet:
    extraArgs:
      max-pods: "60"

Also fails. The patch worked just fine worked against my v1.11.3 nodes that are provisioned with more or less an identical machineconfig.

I have even tried to download the existing machineconfig using -o yaml and then edit the file and apply it and get the same error.

Looking at the logs from apid there isn't anything more helpful that I can see:

c0r1-gpu1: 2025/11/12 19:39:22.946083 log.go:94: InvalidArgument [/machine.MachineService/ApplyConfiguration] 47.284866ms stream rpc error: code = InvalidArgument desc = recovered: expected a mapping node (:authority=c0r1-gpu1:50000;content-type=application/grpc+proxy>proto;grpc-accept-encoding=gzip,gzip;proxyfrom=10.95.186.6;runtime=Talos;talos-role=os:admin;user-agent=grpc-go/1.68.1)

being the only relevant log.

I have also tested using the alpha talosctl and get the same error. Happy to provide any additional logs just let me know.

❯ talosctl -n c1t1-gpu2 patch machineconfig --patch @talos-kubelet-patch.yaml
WARNING: c1t1-gpu2: server version 1.11.3-0 is older than client version 1.11.5
patched MachineConfigs.config.talos.dev/v1alpha1 at the node c1t1-gpu2
WARNING: extra kernel arguments are not supported when booting using SDBoot
Applied configuration without a reboot

Environment

Bare metal nodes

  • Talos version:
❯ talosctl version --nodes c0r1-gpu1
Client:
	Tag:         v1.11.5
	SHA:         undefined
	Built:       2025-11-06T12:35:51Z
	Go version:  go1.25.4
	OS/Arch:     darwin/arm64
Server:
	NODE:        c0r1-gpu1
	Tag:         v1.12.0-alpha.2-16-gc93a9c6b4
	SHA:         c93a9c6b
	Built:
	Go version:  go1.25.4
	OS/Arch:     linux/amd64
	Enabled:     RBAC
  • Kubernetes version:
Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.32.0
  • Platform:

nicolerenee avatar Nov 13 '25 00:11 nicolerenee

I can't reproduce this issue with main at least, it looks like you have some broken build (?). Your exact patch applies, and all the functionality around config patching is fully tested.

smira avatar Nov 13 '25 10:11 smira

This is the command I used to build it:

docker run --rm -t -v $PWD/\_out:/secureboot:ro -v $PWD/\_out:/out -v /dev:/dev --privileged ghcr.io/siderolabs/imager:v1.12.0-alpha.2-16-gc93a9c6b4 \
installer \
--system-extension-image ghcr.io/siderolabs/amd-ucode:20251021@sha256:e64dbc49897ddfdb7ab694b446a25488cdfb7d145f026d63f57026e71593a67c \
--system-extension-image ghcr.io/siderolabs/iscsi-tools:v0.2.0@sha256:e49cc872c25853ed27bce6b5c3ffee281d205c09f03db20f236409581c5f8cb9 \
--system-extension-image ghcr.io/siderolabs/lldpd:1.0.20@sha256:a7c63d6d0e4f6e0452d6b44b9fd989df73e6b240bd2aba37ee309f83cd80fcde \
--system-extension-image ghcr.io/siderolabs/nvidia-container-toolkit-production:570.195.03-v1.18.0@sha256:9e5e63220f9712f6618b52efcd1c88ce7345cb04ca9f6adb0679bd530fd8587a \
--system-extension-image ghcr.io/siderolabs/nvidia-fabricmanager-production:570.195.03@sha256:0a4fc05b9bf1a350b006fcc4534097666bbd4f0842048f8fd87cc1e3b2bb9c2f \
--system-extension-image ghcr.io/nicolerenee/talos/nvidia-gdrdrv-mount:v2.5.1@sha256:18953a855df9d8108b21fe74bef6eec9a6d8077dcc59c2aeef617e5b88d4eebd \
--system-extension-image ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules-production:570.195.03-v1.12.0-alpha.2-6-g64a46a7@sha256:b64d6fc844164037017b05e6428fde80f53510966c701c42106f46cd8106b030 \
--base-installer-image ghcr.io/siderolabs/installer-base:v1.12.0-alpha.2-16-gc93a9c6b4 \
--extra-kernel-arg "iommu=pt"

My nvidia-gdrdrv-mount is the exact same as what got merged into extensions as nvidia-gdrdrv-device. Now that it's merged I can build this again with only extensions y'all auto published, but wanted to double check you weren't seeing anything in this command that is wrong that could be causing the problem.

My other thought is that the machineconfig we are applying during install works but some how gets it into a broken state.

nicolerenee avatar Nov 13 '25 15:11 nicolerenee

I don't have any exact idea here, sorry. If it doesn't work with Talos release, we're happy to look into.

smira avatar Nov 13 '25 15:11 smira

For me, disabling the UEFI configuration on the host solved the problem! No more warning messages.

pedarthurc avatar Nov 14 '25 13:11 pedarthurc

Unfortunately disabling UEFI isn't an option for me.

I have upgraded my nodes to the v1.12.0-beta.0 release with an image built by factory and I'm still getting the same error.

❯ talosctl -n c0r3-gpu1 patch machineconfig --patch @talos-kubelet-patch.yaml
error constructing client: failed to determine endpoints

❯ talosctl version -n c0r3-gpu1
Client:
	Tag:         v1.11.5
	SHA:         undefined
	Built:       2025-11-06T12:35:51Z
	Go version:  go1.25.4
	OS/Arch:     darwin/arm64
Server:
	NODE:        c0r3-gpu1
	Tag:         v1.12.0-beta.0
	SHA:         3d997d74
	Built:
	Go version:  go1.25.4
	OS/Arch:     linux/amd64
	Enabled:     RBAC

❯ k describe no c0r3-gpu1 | grep schematic:
                    extensions.talos.dev/schematic: c1272823a3d5aecf17257649487664aa48397f8cae15b573b0a14165e2c790cf

nicolerenee avatar Nov 18 '25 21:11 nicolerenee

Unfortunately disabling UEFI isn't an option for me.

I have upgraded my nodes to the v1.12.0-beta.0 release with an image built by factory and I'm still getting the same error.

❯ talosctl -n c0r3-gpu1 patch machineconfig --patch @talos-kubelet-patch.yaml
error constructing client: failed to determine endpoints

❯ talosctl version -n c0r3-gpu1
Client:
	Tag:         v1.11.5
	SHA:         undefined
	Built:       2025-11-06T12:35:51Z
	Go version:  go1.25.4
	OS/Arch:     darwin/arm64
Server:
	NODE:        c0r3-gpu1
	Tag:         v1.12.0-beta.0
	SHA:         3d997d74
	Built:
	Go version:  go1.25.4
	OS/Arch:     linux/amd64
	Enabled:     RBAC

❯ k describe no c0r3-gpu1 | grep schematic:
                    extensions.talos.dev/schematic: c1272823a3d5aecf17257649487664aa48397f8cae15b573b0a14165e2c790cf

this does have nothing to do with UEFI

> ❯ talosctl -n c0r3-gpu1 patch machineconfig --patch @talos-kubelet-patch.yaml
> error constructing client: failed to determine endpoints

this means endpoints has not been set in TALOSCONFIG, use --endpoint and --nodes to target a specific node (when using a worker node --endpoint should be a controlplane`

frezbo avatar Nov 19 '25 03:11 frezbo

this does have nothing to do with UEFI

> ❯ talosctl -n c0r3-gpu1 patch machineconfig --patch @talos-kubelet-patch.yaml
> error constructing client: failed to determine endpoints

this means endpoints has not been set in TALOSCONFIG, use --endpoint and --nodes to target a specific node (when using a worker node --endpoint should be a controlplane`

You are correct, sorry I saw the error and just copy pasta without reading close enough. Pointed to the currect talosconfig for this cluster and getting the error still.

❯ talosctl -n c0r3-gpu1 patch machineconfig --patch @talos-kubelet-patch.yaml
recovered: expected a mapping node

nicolerenee avatar Nov 19 '25 06:11 nicolerenee

I think an easy reproducer would be talosctl gen secrets generate a machineconfig with --with-secrets and apply the patch and so can try to reproduce, also the talosctl version and the server version and the patch itself

frezbo avatar Nov 19 '25 06:11 frezbo

Applying a machine config with patch works for me with 1.12-beta.0

./talosctl apply -f controlplane.yaml  -n 10.1.1.14 -i -p '@nvidia.yaml'

And patching the machine config worked

./talosctl patch mc --patch '@cp-schedule.yaml' -n 10.1.1.14

The contents of cp-schedule.yaml is

cluster:
  allowSchedulingOnControlPlanes: true

but there were some patches that failed that traditionally succeed.

using the old hostname config doesn't work

machine:
  network:
    hostname: spark

and applying it

./talosctl patch machineconfig -p '@hostname2.yaml' -n 10.1.1.14
patched MachineConfigs.config.talos.dev/v1alpha1 at the node 10.1.1.14
1 error occurred:
        * 10.1.1.14: rpc error: code = InvalidArgument desc = 1 error occurred:
        * static hostname is already set in v1alpha1 config

If I try to apply the new multi-doc config

apiVersion: v1alpha1
kind: HostnameConfig
hostname: spark

I get the following error

./talosctl apply -f hostname.yaml -n 10.1.1.14
error applying new configuration: 1 error occurred:
        * 10.1.1.14: rpc error: code = InvalidArgument desc = the applied machine configuration doesn't contain v1alpha1 config, did you mean to patch the machine config instead?

@nicolerenee I get the same error as you under a couple conditions

If I forgot @ on my patch file

./talosctl patch machineconfig -p hostname.yaml -n 10.1.1.14 
recovered: expected a mapping node

I tried the same patch for the kubelet in your example and it worked on my system.

rothgar avatar Nov 19 '25 19:11 rothgar