talos
talos copied to clipboard
Talos worker not responding to ARP requests
Bug Report
When I try to cURL to the external IP address for some services I get a response, but for another service in another namespace not. When I add the MAC address to the ARP table it suddenly works, both cURL and arping.
So somehow the worker doesn't always respond to the Request-Who ARP request. I see the ARP request on the Talos worker.
Description
To reproduce:
Deploy Talos to replace Kube-proxy with this patch:
---
machine:
install:
image: factory.talos.dev/installer/ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515:v1.6.7
cluster:
network:
cni:
name: none
proxy:
disabled: true
Deploy Cilium with these values:
ipam:
mode: kubernetes
operator:
clusterPoolIPv4PodCIDR: "172.16.0.0/12"
kubeProxyReplacement: true
securityContext:
capabilities:
cleanCiliumState:
- NET_ADMIN
- SYS_ADMIN
- SYS_RESOURCE
ciliumAgent:
- CHOWN
- KILL
- NET_ADMIN
- NET_RAW
- IPC_LOCK
- SYS_ADMIN
- SYS_RESOURCE
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
cgroup:
autoMount:
enabled: false
hostRoot: "/sys/fs/cgroup"
k8sServiceHost: localhost
k8sServicePort: 7445
Deploy MetalLB with these values:
psp:
create: true
controller:
logLevel: debug
speaker:
logLevel: debug
Apply IPaddresspool and L2Advertisement:
# k -n metallb-system get ipaddresspools.metallb.io -o yaml
apiVersion: v1
items:
- apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
creationTimestamp: "2024-03-26T12:06:47Z"
generation: 7
labels:
argocd.argoproj.io/instance: misc
name: pool-vlan40-private
namespace: metallb-system
resourceVersion: "89439"
uid: efa2054b-394e-4666-b1e0-986d829418ec
spec:
addresses:
- 10.255.254.200-10.255.254.230
autoAssign: true
avoidBuggyIPs: false
kind: List
metadata:
resourceVersion: ""
# k -n metallb-system get l2advertisements.metallb.io -o yaml
apiVersion: v1
items:
- apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
creationTimestamp: "2024-03-26T12:06:47Z"
generation: 1
labels:
argocd.argoproj.io/instance: misc
name: example
namespace: metallb-system
resourceVersion: "16259"
uid: 7e5f5121-2a24-45b7-9b9f-1b02d78b9063
kind: List
metadata:
resourceVersion: ""
Logs
This one doesn't work:
# kubectl -n monitoring get endpointslices kube-prometheus-stack-prometheus-d7d92 -o yaml
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
- 10.244.4.206
conditions:
ready: true
serving: true
terminating: false
nodeName: talos-cji-lqz
targetRef:
kind: Pod
name: prometheus-kube-prometheus-stack-prometheus-0
namespace: monitoring
uid: 315a8c18-d7f8-41ca-abc7-a9168eb37b41
kind: EndpointSlice
metadata:
annotations:
endpoints.kubernetes.io/last-change-trigger-time: "2024-03-27T09:44:30Z"
creationTimestamp: "2024-03-26T12:12:45Z"
generateName: kube-prometheus-stack-prometheus-
generation: 6
labels:
app: kube-prometheus-stack-prometheus
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 57.0.3
argocd.argoproj.io/instance: kube-prometheus-stack
chart: kube-prometheus-stack-57.0.3
endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
heritage: Helm
kubernetes.io/service-name: kube-prometheus-stack-prometheus
release: kube-prometheus-stack
self-monitor: "true"
name: kube-prometheus-stack-prometheus-d7d92
namespace: monitoring
ownerReferences:
- apiVersion: v1
blockOwnerDeletion: true
controller: true
kind: Service
name: kube-prometheus-stack-prometheus
uid: 19cb592f-b6d1-45b9-9b46-9c83e448ceae
resourceVersion: "410123"
uid: ec04aa8b-96d1-44dc-91f0-5c5595def2b8
ports:
- appProtocol: http
name: reloader-web
port: 8080
protocol: TCP
- name: http-web
port: 9090
protocol: TCP
This one does work:
# kubectl -n monitoring get endpointslices kube-prometheus-stack-grafana-qvr5d -o yaml
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
- 10.244.4.27
conditions:
ready: true
serving: true
terminating: false
nodeName: talos-cji-lqz
targetRef:
kind: Pod
name: kube-prometheus-stack-grafana-69d9d5656f-v4jmm
namespace: monitoring
uid: 958aeacf-9045-4a7a-b950-2062d2ea07d6
kind: EndpointSlice
metadata:
annotations:
endpoints.kubernetes.io/last-change-trigger-time: "2024-03-27T09:44:20Z"
creationTimestamp: "2024-03-26T12:12:44Z"
generateName: kube-prometheus-stack-grafana-
generation: 6
labels:
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: grafana
app.kubernetes.io/version: 10.3.3
argocd.argoproj.io/instance: kube-prometheus-stack
endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
helm.sh/chart: grafana-7.3.7
kubernetes.io/service-name: kube-prometheus-stack-grafana
name: kube-prometheus-stack-grafana-qvr5d
namespace: monitoring
ownerReferences:
- apiVersion: v1
blockOwnerDeletion: true
controller: true
kind: Service
name: kube-prometheus-stack-grafana
uid: ffc29999-c79c-4247-825d-1860ec7150e1
resourceVersion: "410028"
uid: 39bff121-a25a-4175-a7dc-d22536d5c6e8
ports:
- name: http-web
port: 3000
protocol: TCP
The Grafana service is the one that is working:
apiVersion: v1
items:
- apiVersion: v1
kind: Service
metadata:
annotations:
host: grafana-k8s
metallb.universe.tf/ip-allocated-from-pool: pool-vlan40-private
creationTimestamp: "2024-03-26T12:12:43Z"
labels:
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: grafana
app.kubernetes.io/version: 10.3.3
argocd.argoproj.io/instance: kube-prometheus-stack
helm.sh/chart: grafana-7.3.7
name: kube-prometheus-stack-grafana
namespace: monitoring
resourceVersion: "89448"
uid: ffc29999-c79c-4247-825d-1860ec7150e1
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 10.98.6.96
clusterIPs:
- 10.98.6.96
externalTrafficPolicy: Cluster
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: http-web
nodePort: 31667
port: 80
protocol: TCP
targetPort: 3000
selector:
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/name: grafana
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 10.255.254.203
- apiVersion: v1
kind: Service
metadata:
annotations:
host: prometheus
http_request: auth unless { http_auth(logins) }
metallb.universe.tf/ip-allocated-from-pool: pool-vlan40-private
creationTimestamp: "2024-03-26T12:12:43Z"
labels:
app: kube-prometheus-stack-prometheus
app.kubernetes.io/instance: kube-prometheus-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 57.0.3
argocd.argoproj.io/instance: kube-prometheus-stack
chart: kube-prometheus-stack-57.0.3
heritage: Helm
release: kube-prometheus-stack
self-monitor: "true"
name: kube-prometheus-stack-prometheus
namespace: monitoring
resourceVersion: "89454"
uid: 19cb592f-b6d1-45b9-9b46-9c83e448ceae
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 10.108.185.153
clusterIPs:
- 10.108.185.153
externalTrafficPolicy: Cluster
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: http-web
nodePort: 30273
port: 9090
protocol: TCP
targetPort: 9090
- appProtocol: http
name: reloader-web
nodePort: 32008
port: 8080
protocol: TCP
targetPort: reloader-web
selector:
app.kubernetes.io/name: prometheus
operator.prometheus.io/name: kube-prometheus-stack-prometheus
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 10.255.254.205
kind: List
metadata:
resourceVersion: ""
I see Request-Who ARP requests outgoing but nothing is responding. This is on the node where the service should live:
# talosctl -n 10.0.0.27 pcap -i enxca166464b476 -o - | tcpdump -vvv -r -
reading from file -, link-type EN10MB (Ethernet), snapshot length 4096
13:14:47.985709 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.255.254.205 tell 10.255.254.11, length 44
13:14:48.986766 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.255.254.205 tell 10.255.254.11, length 44
13:14:49.987947 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.255.254.205 tell 10.255.254.11, length 44
So I see ARP requests but no MetalLB speaker is responding.
More info:
# k get nodes -o wide --show-labels
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME LABELS
talos-5tn-1ws Ready control-plane 18h v1.28.6 10.0.0.35 <none> Talos (v1.6.7) 6.1.82-talos containerd://1.7.13 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=talos-5tn-1ws,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=
talos-8vr-3ik Ready <none> 18h v1.28.6 10.0.0.40 <none> Talos (v1.6.7) 6.1.82-talos containerd://1.7.13 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=talos-8vr-3ik,kubernetes.io/os=linux
talos-b1o-dbx Ready <none> 18h v1.28.6 10.0.0.36 <none> Talos (v1.6.7) 6.1.82-talos containerd://1.7.13 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=talos-b1o-dbx,kubernetes.io/os=linux
talos-d2l-5q8 Ready control-plane 18h v1.28.6 10.0.0.37 <none> Talos (v1.6.7) 6.1.82-talos containerd://1.7.13 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=talos-d2l-5q8,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=
talos-lwg-std Ready <none> 18h v1.28.6 10.0.0.34 <none> Talos (v1.6.7) 6.1.82-talos containerd://1.7.13 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=talos-lwg-std,kubernetes.io/os=linux
talos-r85-4o3 Ready control-plane 18h v1.28.6 10.0.0.38 <none> Talos (v1.6.7) 6.1.82-talos containerd://1.7.13 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=talos-r85-4o3,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=
The Talos doesn't have a shell, but here are all interfaces for my worker nodes:
# talosctl get address -n 10.0.0.40,10.0.0.34,10.0.0.36
NODE NAMESPACE TYPE ID VERSION ADDRESS LINK
10.0.0.40 network AddressStatus cilium_host/10.244.3.82/32 1 10.244.3.82/32 cilium_host
10.0.0.40 network AddressStatus cilium_host/fe80::1ce3:91ff:fee1:4e56/64 1 fe80::1ce3:91ff:fee1:4e56/64 cilium_host
10.0.0.40 network AddressStatus cilium_net/fe80::e847:6cff:fe4f:72d1/64 2 fe80::e847:6cff:fe4f:72d1/64 cilium_net
10.0.0.40 network AddressStatus cilium_vxlan/fe80::1046:74ff:fe9b:55a9/64 2 fe80::1046:74ff:fe9b:55a9/64 cilium_vxlan
10.0.0.40 network AddressStatus enx362882fb8a2e/10.255.254.25/24 1 10.255.254.25/24 enx362882fb8a2e
10.0.0.40 network AddressStatus enx362882fb8a2e/fe80::3428:82ff:fefb:8a2e/64 2 fe80::3428:82ff:fefb:8a2e/64 enx362882fb8a2e
10.0.0.40 network AddressStatus enx56997788166f/10.0.0.40/24 1 10.0.0.40/24 enx56997788166f
10.0.0.40 network AddressStatus enx56997788166f/fe80::5499:77ff:fe88:166f/64 2 fe80::5499:77ff:fe88:166f/64 enx56997788166f
10.0.0.40 network AddressStatus lo/127.0.0.1/8 1 127.0.0.1/8 lo
10.0.0.40 network AddressStatus lo/::1/128 1 ::1/128 lo
10.0.0.40 network AddressStatus lxc_health/fe80::d87f:acff:feeb:ccc7/64 2 fe80::d87f:acff:feeb:ccc7/64 lxc_health
10.0.0.34 network AddressStatus cilium_host/10.244.5.37/32 1 10.244.5.37/32 cilium_host
10.0.0.34 network AddressStatus cilium_host/fe80::68ef:70ff:fe96:5844/64 2 fe80::68ef:70ff:fe96:5844/64 cilium_host
10.0.0.34 network AddressStatus cilium_net/fe80::3018:74ff:fea2:2554/64 2 fe80::3018:74ff:fea2:2554/64 cilium_net
10.0.0.34 network AddressStatus cilium_vxlan/fe80::14ee:64ff:fe9c:21a4/64 2 fe80::14ee:64ff:fe9c:21a4/64 cilium_vxlan
10.0.0.34 network AddressStatus enxbadd30a650a8/10.255.254.24/24 1 10.255.254.24/24 enxbadd30a650a8
10.0.0.34 network AddressStatus enxbadd30a650a8/fe80::b8dd:30ff:fea6:50a8/64 2 fe80::b8dd:30ff:fea6:50a8/64 enxbadd30a650a8
10.0.0.34 network AddressStatus enxce72d4661dc7/10.0.0.34/24 1 10.0.0.34/24 enxce72d4661dc7
10.0.0.34 network AddressStatus enxce72d4661dc7/fe80::cc72:d4ff:fe66:1dc7/64 2 fe80::cc72:d4ff:fe66:1dc7/64 enxce72d4661dc7
10.0.0.34 network AddressStatus lo/127.0.0.1/8 1 127.0.0.1/8 lo
10.0.0.34 network AddressStatus lo/::1/128 1 ::1/128 lo
10.0.0.34 network AddressStatus lxc_health/fe80::b406:f3ff:fe91:8cdf/64 2 fe80::b406:f3ff:fe91:8cdf/64 lxc_health
10.0.0.36 network AddressStatus cilium_host/10.244.4.67/32 1 10.244.4.67/32 cilium_host
10.0.0.36 network AddressStatus cilium_host/fe80::c847:fbff:fe72:c6b4/64 1 fe80::c847:fbff:fe72:c6b4/64 cilium_host
10.0.0.36 network AddressStatus cilium_net/fe80::2c11:aeff:feb4:bd08/64 2 fe80::2c11:aeff:feb4:bd08/64 cilium_net
10.0.0.36 network AddressStatus cilium_vxlan/fe80::8058:15ff:fe2a:304e/64 2 fe80::8058:15ff:fe2a:304e/64 cilium_vxlan
10.0.0.36 network AddressStatus enx1adc8919626d/10.255.254.23/24 1 10.255.254.23/24 enx1adc8919626d
10.0.0.36 network AddressStatus enx1adc8919626d/fe80::18dc:89ff:fe19:626d/64 2 fe80::18dc:89ff:fe19:626d/64 enx1adc8919626d
10.0.0.36 network AddressStatus enx3e800ccafb16/10.0.0.36/24 1 10.0.0.36/24 enx3e800ccafb16
10.0.0.36 network AddressStatus enx3e800ccafb16/fe80::3c80:cff:feca:fb16/64 2 fe80::3c80:cff:feca:fb16/64 enx3e800ccafb16
10.0.0.36 network AddressStatus lo/127.0.0.1/8 1 127.0.0.1/8 lo
10.0.0.36 network AddressStatus lo/::1/128 1 ::1/128 lo
10.0.0.36 network AddressStatus lxc0d0f6780735f/fe80::dca5:5bff:fe3e:5672/64 2 fe80::dca5:5bff:fe3e:5672/64 lxc0d0f6780735f
10.0.0.36 network AddressStatus lxc10bf2ddab488/fe80::b89c:8aff:fe0f:101b/64 2 fe80::b89c:8aff:fe0f:101b/64 lxc10bf2ddab488
10.0.0.36 network AddressStatus lxc4388096d77f4/fe80::8850:e5ff:fe7b:ae88/64 2 fe80::8850:e5ff:fe7b:ae88/64 lxc4388096d77f4
10.0.0.36 network AddressStatus lxc4c1e58c87b3d/fe80::103c:d9ff:fe90:ebb/64 2 fe80::103c:d9ff:fe90:ebb/64 lxc4c1e58c87b3d
10.0.0.36 network AddressStatus lxc518b07161ee3/fe80::40b2:eaff:fe42:f0ab/64 2 fe80::40b2:eaff:fe42:f0ab/64 lxc518b07161ee3
10.0.0.36 network AddressStatus lxc54c64a0cb1d2/fe80::455:52ff:fe69:b00e/64 2 fe80::455:52ff:fe69:b00e/64 lxc54c64a0cb1d2
10.0.0.36 network AddressStatus lxc77ab8ddbfaab/fe80::d8fa:2cff:fec5:d854/64 2 fe80::d8fa:2cff:fec5:d854/64 lxc77ab8ddbfaab
10.0.0.36 network AddressStatus lxc784733a45b21/fe80::d8c5:4bff:fe19:e68a/64 2 fe80::d8c5:4bff:fe19:e68a/64 lxc784733a45b21
10.0.0.36 network AddressStatus lxc8a1aff154637/fe80::f421:ccff:fe0b:a469/64 2 fe80::f421:ccff:fe0b:a469/64 lxc8a1aff154637
10.0.0.36 network AddressStatus lxc9722e215666a/fe80::cc24:50ff:fe43:bd44/64 2 fe80::cc24:50ff:fe43:bd44/64 lxc9722e215666a
10.0.0.36 network AddressStatus lxc_health/fe80::85f:23ff:fe38:ae1d/64 2 fe80::85f:23ff:fe38:ae1d/64 lxc_health
10.0.0.36 network AddressStatus lxcb57cf2e6c4d6/fe80::98a5:29ff:fed8:29df/64 2 fe80::98a5:29ff:fed8:29df/64 lxcb57cf2e6c4d6
10.0.0.36 network AddressStatus lxcfe0a0b35e635/fe80::ccb9:28ff:fe02:1581/64 2 fe80::ccb9:28ff:fe02:1581/64 lxcfe0a0b35e635
10.0.0.36 network AddressStatus lxcfe59aa50425d/fe80::14e9:2dff:fe78:dcb2/64 2 fe80::14e9:2dff:fe78:dcb2/64 lxcfe59aa50425d
10.0.0.36 network AddressStatus lxcfec1bb24f245/fe80::58d3:77ff:fe5e:d09c/64 2 fe80::58d3:77ff:fe5e:d09c/64 lxcfec1bb24f245
There are no events in the services, so after a recreation:
# k -n monitoring describe svc kube-prometheus-stack-prometheus
Name: kube-prometheus-stack-prometheus
Namespace: monitoring
Labels: app=kube-prometheus-stack-prometheus
app.kubernetes.io/instance=kube-prometheus-stack
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/part-of=kube-prometheus-stack
app.kubernetes.io/version=57.1.1
chart=kube-prometheus-stack-57.1.1
heritage=Helm
release=kube-prometheus-stack
self-monitor=true
Annotations: meta.helm.sh/release-name: kube-prometheus-stack
meta.helm.sh/release-namespace: monitoring
metallb.universe.tf/ip-allocated-from-pool: pool-vlan40-private
Selector: app.kubernetes.io/name=prometheus,operator.prometheus.io/name=kube-prometheus-stack-prometheus
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.108.61.29
IPs: 10.108.61.29
LoadBalancer Ingress: 10.255.254.213
Port: http-web 9090/TCP
TargetPort: 9090/TCP
NodePort: http-web 30991/TCP
Endpoints: 10.244.4.3:9090
Port: reloader-web 8080/TCP
TargetPort: reloader-web/TCP
NodePort: reloader-web 32131/TCP
Endpoints: 10.244.4.3:8080
Session Affinity: None
External Traffic Policy: Cluster
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal IPAllocated 25s metallb-controller Assigned IP ["10.255.254.213"]
Normal nodeAssigned 25s metallb-speaker announcing from node "talos-5tn-1ws" with protocol "layer2"
# curl 10.255.254.213:9093
curl: (7) Failed to connect to 10.255.254.213 port 9093 after 3072 ms: Couldn't connect to server
# curl 10.255.254.213:9093
curl: (7) Failed to connect to 10.255.254.213 port 9093 after 1663 ms: Couldn't connect to server
# arping -I ens20 10.255.254.213
ARPING 10.255.254.213
Timeout
Environment
- Talos version: v1.6.7
- Kubernetes version: v1.29.3 / v1.28.6
- Platform: Proxmox
- MetalLB 0.14.3 I've tried using Cilium 1.15.3 and the default flannel/kube-proxy that is delivered wit Talos.
Talos doesn't block ARP requests/responses, but in general for the Linux kernel to respond to ARP requests, address should be assigned to the host.