talos
talos copied to clipboard
Talos `1.7.2`: bare metal provisioning got intermittent success when using bonded interfaces
Bug Report
Description
I'm using Sidero Metal for provisioning. I have 2 identical Dell R340 servers. Network wise they have 4 network interfaces each:
eno1 - Interface UP
eno2 - Interface UP
enp3s0f0 - Interface Down
enp3s0f1 - Interface Down
My TalosControlPlane
network config looks like this:
- op: add
path: /machine/network/interfaces
value:
- dhcp: true
mtu: 9000
interface: bond0
bond:
interfaces:
- eno1
- eno2
mode: 802.3ad
lacpRate: fast
xmitHashPolicy: layer2+3
vip:
ip: 172.18.0.8
vlans:
- vlanId: 3072
dhcp: false
mtu: 9000
- vlanId: 3073
dhcp: false
mtu: 9000
- vlanId: 3074
dhcp: false
mtu: 9000
- vlanId: 3075
dhcp: false
mtu: 9000
- vlanId: 3076
dhcp: false
mtu: 9000
- vlanId: 3077
dhcp: false
mtu: 9000
- vlanId: 3078
dhcp: false
mtu: 9000
My Juniper EX2300 core switch has ae
interfaces configured for both servers:
set interfaces ae2 description r340-1
set interfaces ae2 vlan-tagging
set interfaces ae2 native-vlan-id 2048
set interfaces ae2 mtu 9000
set interfaces ae2 aggregated-ether-options lacp periodic fast
set interfaces ae2 aggregated-ether-options lacp force-up
set interfaces ae2 unit 0 family ethernet-switching interface-mode trunk
set interfaces ae2 unit 0 family ethernet-switching vlan members compute
set interfaces ae2 unit 0 family ethernet-switching vlan members network_mgmt
set interfaces ae2 unit 0 family ethernet-switching vlan members device0
set interfaces ae2 unit 0 family ethernet-switching vlan members device1
set interfaces ae2 unit 0 family ethernet-switching vlan members debug
set interfaces interface-range r340-1 member ge-0/0/12
set interfaces interface-range r340-1 member ge-0/0/13
set interfaces interface-range r340-1 ether-options 802.3ad ae2
set interfaces ae3 description r340-2
set interfaces ae3 vlan-tagging
set interfaces ae3 native-vlan-id 2048
set interfaces ae3 mtu 9000
set interfaces ae3 aggregated-ether-options lacp periodic fast
set interfaces ae3 aggregated-ether-options lacp force-up
set interfaces ae3 unit 0 family ethernet-switching interface-mode trunk
set interfaces ae3 unit 0 family ethernet-switching vlan members compute
set interfaces ae3 unit 0 family ethernet-switching vlan members network_mgmt
set interfaces ae3 unit 0 family ethernet-switching vlan members device0
set interfaces ae3 unit 0 family ethernet-switching vlan members device1
set interfaces ae3 unit 0 family ethernet-switching vlan members debug
set interfaces interface-range r340-2 member ge-0/0/8
set interfaces interface-range r340-2 member ge-0/0/9
set interfaces interface-range r340-2 ether-options 802.3ad ae3
- Both servers successfully installed Talos
-
etcd
failed to bootstrap automatically. Had to bootstrap manually - One of the servers booted fine
- Another server reported Failed connectivity with error:
- When I reboot servers ether both of them could come alive, just one or both could fail
One of the servers failed after reboot:
They flipped after another reboot:
Both failed after another reboot:
-
Same issue I see when using
deviceSelectors
: link -
Servers got successfully provisioned if I keep networking configuration blank.
Logs
My full config:
apiVersion: metal.sidero.dev/v1alpha2
kind: Environment
metadata:
name: default
spec:
initrd:
url: http://github.com/talos-systems/talos/releases/download/v1.7.2/initramfs-amd64.xz
kernel:
args:
- console=tty0
- console=ttyS0
- consoleblank=0
- earlyprintk=ttyS0
- ima_appraise=fix
- ima_hash=sha512
- ima_template=ima-ng
- init_on_alloc=1
- initrd=initramfs.xz
- nvme_core.io_timeout=4294967295
- printk.devkmsg=on
- pti=on
- slab_nomerge=
- talos.platform=metal
url: http://github.com/talos-systems/talos/releases/download/v1.7.2/vmlinuz-amd64
---
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
name: primus-cp
namespace: default
spec:
controlPlaneConfig:
controlplane:
generateType: controlplane
talosVersion: v1.7.2
configPatches:
- op: add
path: /machine/kubelet/image
value: ghcr.io/siderolabs/kubelet:v1.28.9
- op: add
path: /cluster/etcd/image
value: gcr.io/etcd-development/etcd:v3.5.13
- op: add
path: /cluster/coreDNS
value:
disabled: true
- op: add
path: /cluster/scheduler/image
value: registry.k8s.io/kube-scheduler:v1.28.9
- op: add
path: /cluster/scheduler/extraArgs
value:
bind-address: 0.0.0.0
- op: add
path: /cluster/controllerManager/image
value: registry.k8s.io/kube-controller-manager:v1.28.9
- op: add
path: /cluster/controllerManager/extraArgs
value:
bind-address: 0.0.0.0
- op: add
path: /cluster/apiServer/image
value: registry.k8s.io/kube-apiserver:v1.28.9
- op: add
path: /cluster/proxy/image
value: registry.k8s.io/kube-proxy:v1.28.9
- op: add
path: /machine/features
value:
rbac: true
stableHostname: true
apidCheckExtKeyUsage: true
kubernetesTalosAPIAccess:
enabled: true
allowedRoles:
- os:admin
allowedKubernetesNamespaces:
- do-tools
- op: add
path: /machine/registries/mirrors
value:
ghcr.io:
endpoints:
- https://artifactory.test.com
- op: add
path: /cluster/allowSchedulingOnControlPlanes
value: true
- op: add
path: /cluster/network/cni
value:
name: custom
urls:
- devops/yaml/flannel/v0.25.1/flannel.yaml
- op: add
path: /cluster/apiServer/certSANs
value:
- k8s.test.io
- test
- op: add
path: /cluster/extraManifests
value:
- devops/yaml/coredns/1.11.1/coredns.yaml
- op: add
path: /machine/network/interfaces
value:
- dhcp: true
mtu: 9000
interface: bond0
bond:
interfaces:
- eno1
- eno2
mode: 802.3ad
lacpRate: fast
xmitHashPolicy: layer2+3
vip:
ip: 172.18.0.8
vlans:
- vlanId: 3072
dhcp: false
mtu: 9000
- vlanId: 3073
dhcp: false
mtu: 9000
- vlanId: 3074
dhcp: false
mtu: 9000
- vlanId: 3075
dhcp: false
mtu: 9000
- vlanId: 3076
dhcp: false
mtu: 9000
- vlanId: 3077
dhcp: false
mtu: 9000
- vlanId: 3078
dhcp: false
mtu: 9000
infrastructureTemplate:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalMachineTemplate
name: primus-cp
replicas: 2
version: 1.28.9
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: primus
namespace: default
spec:
clusterNetwork:
pods:
cidrBlocks:
- 172.24.0.0/14
services:
cidrBlocks:
- 172.28.0.0/14
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
name: primus-cp
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalCluster
name: primus
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalCluster
metadata:
name: primus
namespace: default
spec:
controlPlaneEndpoint:
host: 172.18.0.8
port: 6443
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalMachineTemplate
metadata:
name: primus-cp
namespace: default
spec:
template:
spec:
serverClassRef:
apiVersion: metal.sidero.dev/v1alpha2
kind: ServerClass
name: generic
---
apiVersion: metal.sidero.dev/v1alpha2
kind: ServerClass
metadata:
name: generic
spec:
configPatches:
- op: add
path: /cluster/clusterName
value: primus
- op: add
path: /machine/install/image
value: ghcr.io/siderolabs/installer:v1.7.2
- op: add
path: /machine/logging
value:
destinations:
- endpoint: "udp://127.0.0.1:12335/"
format: "json_lines"
- op: add
path: /machine/kubelet/nodeIP
value:
validSubnets:
- 172.18.0.0/24
- op: add
path: /machine/systemDiskEncryption
value:
ephemeral:
provider: luks2
keys:
- nodeID: {}
slot: 0
state:
provider: luks2
keys:
- nodeID: {}
slot: 0
- op: add
path: /machine/certSANs
value:
- primus
- op: add
path: /machine/files
value:
- path: /etc/ssl/certs/ca-certificates
permissions: 0644
op: append
content: |
-----BEGIN CERTIFICATE-----
MIIFUTCCAzmgAwIBAgIRAMxzUVWLQN96FYDcBtOsbHIwDQYJKoZIhvcNAQELBQAw
NjEXMBUGA1UEChMOQmVya3NoaXJlIEdyZXkxGzAZBgNVBAMTEnZhdWx0LmF3cy5i
-----END CERTIFICATE-----
- path: /etc/ssl/certs/ca-certificates
permissions: 0420
op: append
content: |
-----BEGIN CERTIFICATE-----
MIIFZzCCA0+gAwIBAgIUQmnJVYRaQsYhJ8LzbaZnkO/xdjowDQYJKoZIhvcNAQEL
BQAwOzEXMBUGA1UEChMOQmVya3NoaXJlIEdyZXkxDjAMBgNVBAsTBVZhdWx0MRAw
-----END CERTIFICATE-----
- op: add
path: /machine/time
value: {}
- op: add
path: /machine/time/servers
value:
- 172.17.0.5
- op: add
path: /machine/install/extraKernelArgs
value:
- cpufreq.default_governor=performance
- op: add
path: /machine/install/diskSelector
value:
size: "< 900GB"
Environment
- Talos version: [
1.7.2
] - Kubernetes version: [
1.28.9
] - Platform: metal