talos icon indicating copy to clipboard operation
talos copied to clipboard

Talos `1.7.2`: bare metal provisioning got intermittent success when using bonded interfaces

Open eugene-marchanka opened this issue 9 months ago • 1 comments

Bug Report

Description

I'm using Sidero Metal for provisioning. I have 2 identical Dell R340 servers. Network wise they have 4 network interfaces each:

eno1 - Interface UP
eno2 - Interface UP
enp3s0f0 - Interface Down
enp3s0f1 - Interface Down

My TalosControlPlane network config looks like this:

      - op: add
        path: /machine/network/interfaces
        value:
        - dhcp: true
          mtu: 9000
          interface: bond0
          bond:
            interfaces:
            - eno1
            - eno2
            mode: 802.3ad
            lacpRate: fast
            xmitHashPolicy: layer2+3
          vip:
            ip: 172.18.0.8
          vlans:
          - vlanId: 3072
            dhcp: false
            mtu: 9000
          - vlanId: 3073
            dhcp: false
            mtu: 9000
          - vlanId: 3074
            dhcp: false
            mtu: 9000
          - vlanId: 3075
            dhcp: false
            mtu: 9000
          - vlanId: 3076
            dhcp: false
            mtu: 9000
          - vlanId: 3077
            dhcp: false
            mtu: 9000
          - vlanId: 3078
            dhcp: false
            mtu: 9000

My Juniper EX2300 core switch has ae interfaces configured for both servers:

set interfaces ae2 description r340-1
set interfaces ae2 vlan-tagging
set interfaces ae2 native-vlan-id 2048
set interfaces ae2 mtu 9000
set interfaces ae2 aggregated-ether-options lacp periodic fast
set interfaces ae2 aggregated-ether-options lacp force-up
set interfaces ae2 unit 0 family ethernet-switching interface-mode trunk
set interfaces ae2 unit 0 family ethernet-switching vlan members compute
set interfaces ae2 unit 0 family ethernet-switching vlan members network_mgmt
set interfaces ae2 unit 0 family ethernet-switching vlan members device0
set interfaces ae2 unit 0 family ethernet-switching vlan members device1
set interfaces ae2 unit 0 family ethernet-switching vlan members debug

set interfaces interface-range r340-1 member ge-0/0/12
set interfaces interface-range r340-1 member ge-0/0/13
set interfaces interface-range r340-1 ether-options 802.3ad ae2


set interfaces ae3 description r340-2
set interfaces ae3 vlan-tagging
set interfaces ae3 native-vlan-id 2048
set interfaces ae3 mtu 9000
set interfaces ae3 aggregated-ether-options lacp periodic fast
set interfaces ae3 aggregated-ether-options lacp force-up
set interfaces ae3 unit 0 family ethernet-switching interface-mode trunk
set interfaces ae3 unit 0 family ethernet-switching vlan members compute
set interfaces ae3 unit 0 family ethernet-switching vlan members network_mgmt
set interfaces ae3 unit 0 family ethernet-switching vlan members device0
set interfaces ae3 unit 0 family ethernet-switching vlan members device1
set interfaces ae3 unit 0 family ethernet-switching vlan members debug

set interfaces interface-range r340-2 member ge-0/0/8
set interfaces interface-range r340-2 member ge-0/0/9
set interfaces interface-range r340-2 ether-options 802.3ad ae3
  • Both servers successfully installed Talos
  • etcd failed to bootstrap automatically. Had to bootstrap manually
  • One of the servers booted fine
  • Another server reported Failed connectivity with error: image
  • When I reboot servers ether both of them could come alive, just one or both could fail

One of the servers failed after reboot: image

They flipped after another reboot: image

Both failed after another reboot: image

  • Same issue I see when using deviceSelectors: link

  • Servers got successfully provisioned if I keep networking configuration blank.

Logs

My full config:

apiVersion: metal.sidero.dev/v1alpha2
kind: Environment
metadata:
  name: default
spec:
  initrd:
    url: http://github.com/talos-systems/talos/releases/download/v1.7.2/initramfs-amd64.xz
  kernel:
    args:
    - console=tty0
    - console=ttyS0
    - consoleblank=0
    - earlyprintk=ttyS0
    - ima_appraise=fix
    - ima_hash=sha512
    - ima_template=ima-ng
    - init_on_alloc=1
    - initrd=initramfs.xz
    - nvme_core.io_timeout=4294967295
    - printk.devkmsg=on
    - pti=on
    - slab_nomerge=
    - talos.platform=metal
    url: http://github.com/talos-systems/talos/releases/download/v1.7.2/vmlinuz-amd64

---
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: primus-cp
  namespace: default
spec:
  controlPlaneConfig:
    controlplane:
      generateType: controlplane
      talosVersion: v1.7.2
      configPatches:
      - op: add
        path: /machine/kubelet/image
        value: ghcr.io/siderolabs/kubelet:v1.28.9
      - op: add
        path: /cluster/etcd/image
        value: gcr.io/etcd-development/etcd:v3.5.13
      - op: add
        path: /cluster/coreDNS
        value:
          disabled: true
      - op: add
        path: /cluster/scheduler/image
        value: registry.k8s.io/kube-scheduler:v1.28.9
      - op: add
        path: /cluster/scheduler/extraArgs
        value:
          bind-address: 0.0.0.0
      - op: add
        path: /cluster/controllerManager/image
        value: registry.k8s.io/kube-controller-manager:v1.28.9
      - op: add
        path: /cluster/controllerManager/extraArgs
        value:
          bind-address: 0.0.0.0
      - op: add
        path: /cluster/apiServer/image
        value: registry.k8s.io/kube-apiserver:v1.28.9
      - op: add
        path: /cluster/proxy/image
        value: registry.k8s.io/kube-proxy:v1.28.9
      - op: add
        path: /machine/features
        value:
          rbac: true
          stableHostname: true
          apidCheckExtKeyUsage: true
          kubernetesTalosAPIAccess:
            enabled: true
            allowedRoles:
              - os:admin
            allowedKubernetesNamespaces:
              - do-tools
      - op: add
        path: /machine/registries/mirrors
        value:
          ghcr.io:
            endpoints:
              - https://artifactory.test.com
      - op: add
        path: /cluster/allowSchedulingOnControlPlanes
        value: true
      - op: add
        path: /cluster/network/cni
        value:
          name: custom
          urls:
          - devops/yaml/flannel/v0.25.1/flannel.yaml
      - op: add
        path: /cluster/apiServer/certSANs
        value:
        - k8s.test.io
        - test
      - op: add
        path: /cluster/extraManifests
        value:
        - devops/yaml/coredns/1.11.1/coredns.yaml
      - op: add
        path: /machine/network/interfaces
        value:
        - dhcp: true
          mtu: 9000
          interface: bond0
          bond:
            interfaces:
            - eno1
            - eno2
            mode: 802.3ad
            lacpRate: fast
            xmitHashPolicy: layer2+3
          vip:
            ip: 172.18.0.8
          vlans:
          - vlanId: 3072
            dhcp: false
            mtu: 9000
          - vlanId: 3073
            dhcp: false
            mtu: 9000
          - vlanId: 3074
            dhcp: false
            mtu: 9000
          - vlanId: 3075
            dhcp: false
            mtu: 9000
          - vlanId: 3076
            dhcp: false
            mtu: 9000
          - vlanId: 3077
            dhcp: false
            mtu: 9000
          - vlanId: 3078
            dhcp: false
            mtu: 9000

  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: MetalMachineTemplate
    name: primus-cp
  replicas: 2
  version: 1.28.9

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: primus
  namespace: default
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 172.24.0.0/14
    services:
      cidrBlocks:
      - 172.28.0.0/14
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: TalosControlPlane
    name: primus-cp
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: MetalCluster
    name: primus

---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalCluster
metadata:
  name: primus
  namespace: default
spec:
  controlPlaneEndpoint:
    host: 172.18.0.8
    port: 6443

---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalMachineTemplate
metadata:
  name: primus-cp
  namespace: default
spec:
  template:
    spec:
      serverClassRef:
        apiVersion: metal.sidero.dev/v1alpha2
        kind: ServerClass
        name: generic
---
apiVersion: metal.sidero.dev/v1alpha2
kind: ServerClass
metadata:
  name: generic
spec:
  configPatches:
  - op: add
    path: /cluster/clusterName
    value: primus
  - op: add
    path: /machine/install/image
    value: ghcr.io/siderolabs/installer:v1.7.2
  - op: add
    path: /machine/logging
    value:
      destinations:
      - endpoint: "udp://127.0.0.1:12335/"
        format: "json_lines"
  - op: add
    path: /machine/kubelet/nodeIP
    value:
      validSubnets:
      - 172.18.0.0/24
  - op: add
    path: /machine/systemDiskEncryption
    value:
      ephemeral:
        provider: luks2
        keys:
          - nodeID: {}
            slot: 0
      state:
        provider: luks2
        keys:
          - nodeID: {}
            slot: 0
  - op: add
    path: /machine/certSANs
    value:
    - primus
  - op: add
    path: /machine/files
    value:
      - path: /etc/ssl/certs/ca-certificates
        permissions: 0644
        op: append
        content: |
          -----BEGIN CERTIFICATE-----
          MIIFUTCCAzmgAwIBAgIRAMxzUVWLQN96FYDcBtOsbHIwDQYJKoZIhvcNAQELBQAw
          NjEXMBUGA1UEChMOQmVya3NoaXJlIEdyZXkxGzAZBgNVBAMTEnZhdWx0LmF3cy5i
          -----END CERTIFICATE-----
      - path: /etc/ssl/certs/ca-certificates
        permissions: 0420
        op: append
        content: |
          -----BEGIN CERTIFICATE-----
          MIIFZzCCA0+gAwIBAgIUQmnJVYRaQsYhJ8LzbaZnkO/xdjowDQYJKoZIhvcNAQEL
          BQAwOzEXMBUGA1UEChMOQmVya3NoaXJlIEdyZXkxDjAMBgNVBAsTBVZhdWx0MRAw
          -----END CERTIFICATE-----
  - op: add
    path: /machine/time
    value: {}
  - op: add
    path: /machine/time/servers
    value:
    - 172.17.0.5
  - op: add
    path: /machine/install/extraKernelArgs
    value:
    - cpufreq.default_governor=performance
  - op: add
    path: /machine/install/diskSelector
    value:
      size: "< 900GB"


Environment

  • Talos version: [1.7.2]
  • Kubernetes version: [1.28.9]
  • Platform: metal

eugene-marchanka avatar May 25 '24 22:05 eugene-marchanka