calico icon indicating copy to clipboard operation
calico copied to clipboard

wireguard.cali interface does not have IPv4 address

Open geotransformer opened this issue 1 year ago • 7 comments

======= wireguard interface has no ip address in one node of 3 nodes k8s cluster======== ubuntu@k8s-node3:~$ ifconfig wireguard.cali wireguard.cali: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1440 unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC) RX packets 299392534 bytes 133667595056 (133.6 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 263705618 bytes 55068382752 (55.0 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

=========kubeadm based k8s cluster ubuntu@k8s-node3:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-node1 Ready control-plane 40h v1.26.5 k8s-node2 Ready control-plane 39h v1.26.5 k8s-node3 Ready control-plane 39h v1.26.5

========= pods scheduled ubuntu@k8s-node3:~$ kubectl get pods -A | wc -l 362 ubuntu@k8s-node3:~$ kubectl get pods -A -owide | grep k8s-node3 | wc -l 107

========= pod subnet ============= apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration kubernetesVersion: v1.26.5 certificatesDir: /data/kubernetes/pki networking: serviceSubnet: 10.152.4.0/23 podSubnet: 10.152.2.0/23 apiServer:

Expected Behavior

ubuntu@k8s-node1:~$ ifconfig wireguard.cali wireguard.cali: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1440 inet 10.152.2.66 netmask 255.255.255.255 destination 10.152.2.66

Current Behavior

ubuntu@k8s-node3:~$ ifconfig wireguard.cali wireguard.cali: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1440 unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC) RX packets 299392534 bytes 133667595056 (133.6 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 263705618 bytes 55068382752 (55.0 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Possible Solution

Steps to Reproduce (for bugs)

  1. Upgrade k8s and os in a rolling fashion, one node at a time

Context

Your Environment

  • Calico version calicoctl v3_24
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
  • Operating System and version: Ubuntu 20.04.6 LTS
  • Link to your project (optional):

geotransformer avatar Apr 28 '24 12:04 geotransformer

Calico node pod logs for the impacted node ubuntu@k8s-node3:~$ kubectl get pods -A -owide | grep k8s-node3 | grep calico kube-system calico-node-zbdf8 1/1 Running 1 40h 10.152.1.252 k8s-node3

ubuntu@k8s-node3:~$ date Sun 28 Apr 2024 12:29:14 PM UTC

ubuntu@k8s-node3:~$ kubectl logs -n kube-system calico-node-zbdf8 | grep -i guard

****2024-04-28 12:28:48.242 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node3" public_key:"vFqdz4DFUYlAaGzN4O3p7vkFfoxNr+aIY94e48lZ+mQ="

2024-04-28 12:28:51.607 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node2" public_key:"+Ek2nxBsI60WEYfoMdQmAFUZFllR4dzB2yS80yjMDFQ=" interface_ipv4_addr:"10.152.2.131" 2024-04-28 12:28:54.351 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node1" public_key:"iVS+PMScWh65pQS2yr0jcV9oPgsd3UbM/SwodOpB8nQ=" interface_ipv4_addr:"10.152.2.66" 2024-04-28 12:28:58.429 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node3" public_key:"vFqdz4DFUYlAaGzN4O3p7vkFfoxNr+aIY94e48lZ+mQ=" 2024-04-28 12:29:01.926 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node2" public_key:"+Ek2nxBsI60WEYfoMdQmAFUZFllR4dzB2yS80yjMDFQ=" interface_ipv4_addr:"10.152.2.131" 2024-04-28 12:29:04.502 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node1" public_key:"iVS+PMScWh65pQS2yr0jcV9oPgsd3UbM/SwodOpB8nQ=" interface_ipv4_addr:"10.152.2.66" 2024-04-28 12:29:08.555 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node3" public_key:"vFqdz4DFUYlAaGzN4O3p7vkFfoxNr+aIY94e48lZ+mQ=" 2024-04-28 12:29:12.032 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node2" public_key:"+Ek2nxBsI60WEYfoMdQmAFUZFllR4dzB2yS80yjMDFQ=" interface_ipv4_addr:"10.152.2.131"

geotransformer avatar Apr 28 '24 12:04 geotransformer

Is the problem on a single node only or across all the node?

Upgrade k8s and os in a rolling fashion, one node at a time

Could you state in the description what got you to this state? You had a cluster working and then you upgraded k8s and os? Is this the first node updated? Is there incompatibility in wg getween the old nodes and new nodes?

tomastigera avatar Apr 29 '24 23:04 tomastigera

wg

Is the problem on a single node only or across all the node?

Upgrade k8s and os in a rolling fashion, one node at a time

Could you state in the description what got you to this state? You had a cluster working and then you upgraded k8s and os? Is this the first node updated? Is there incompatibility in wg getween the old nodes and new nodes?

1> K8s upgraded from 1.25 to 1.26. Calico has no change, same version 3.24. The impacted node is not always the same node, node 2 or sometimes node3. The issue observed in 3~4 times out of 100 upgrades.

2> For k8s upgrade, the node will be cordoned, drained, and removed from the k8s cluster. Then it will be the OS upgrade, and kubeadm join the node back to the k8s cluster

3> If ip add del xxx dev wireguard.cali, the ip can be restored by calico itself. wondering why the following scenario it cannot be recovered itself. ****2024-04-28 12:28:48.242 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node3" public_key:"vFqdz4DFUYlAaGzN4O3p7vkFfoxNr+aIY94e48lZ+mQ="

2024-04-28 12:28:51.607 [INFO][84] felix/int_dataplane.go 1946: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"k8s-node2" public_key:"+Ek2nxBsI60WEYfoMdQmAFUZFllR4dzB2yS80yjMDFQ=" interface_ipv4_addr:"10.152.2.131"

=========== the following is a capture when trying to reproduce the issue ====

node1" public_key:"oB8moC5Qw4tbVnyvjRlEi3abHkpU5k8YCalNqAy49ik=" interface_ipv4_addr:"10.28.2.133" 2024-04-29 22:13:53.354 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"test-node1" public_key:"oB8moC5Qw4tbVnyvjRlEi3abHkpU5k8YCalNqAy49ik=" interface_ipv4_addr:"10.28.2.133"

2024-04-29 22:13:58.731 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"test-node1" public_key:"oB8moC5Qw4tbVnyvjRlEi3abHkpU5k8YCalNqAy49ik=" interface_ipv4_addr:"10.28.2.133"

2024-04-29 22:17:53.521 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointRemove update from calculation graph msg=hostname:"test-node1"

2024-04-29 22:18:07.680 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"test-node1" public_key:"d5eb99Gp3YQYrXeBEWf7P0+QTF7Uof4g3s5dwkwONzU="

2024-04-29 22:18:07.733 [INFO][1258] felix/int_dataplane.go 1680: Received *proto.WireguardEndpointUpdate update from calculation graph msg=hostname:"test-node1" public_key:"d5eb99Gp3YQYrXeBEWf7P0+QTF7Uof4g3s5dwkwONzU=" interface_ipv4_addr:"10.28.2.136"

geotransformer avatar Apr 30 '24 02:04 geotransformer

OK so the issue is isolated to individual nodes. Could you share full logs from a node? An issue might be compatibility with k8s 1.26. Calico 3.24 is not quite supported anymore. You might need to upgrade.

tomastigera avatar Apr 30 '24 16:04 tomastigera

@geotransformer may I also ask you to enable debug logging in felix? Set logSeverityScreen to Debug in the default FelixConfiguration: https://docs.tigera.io/calico/latest/operations/troubleshoot/component-logs#configure-felix-log-level

coutinhop avatar Apr 30 '24 16:04 coutinhop

OK so the issue is isolated to individual nodes. Could you share full logs from a node? An issue might be compatibility with k8s 1.26. Calico 3.24 is not quite supported anymore. You might need to upgrade.

We observed the same issue on Calico 3.27.

One thing we would like share here first. The pod subnet configured in this 3 node clusters is /23. We use the default kubernetes/Calico config. So one node cannot get a /24 cidr. Kubernetes complained the cidr is not available for node3. In Calico, I believe ipam manages the ip block and allocation. So this warning or error message seems not a SOS issue.

Also in our 3 node deployment, we have 360+ pods and need 310+ pod ips. During the upgrade, nodes will be cordoned and drained, and pods will be created again on the node. @coutinhop is there some race condition for ip recycle and reuse for calico Interfaces, like Wireguard.cali

geotransformer avatar May 01 '24 00:05 geotransformer

@geotransformer may I also ask you to enable debug logging in felix? Set logSeverityScreen to Debug in the default FelixConfiguration: https://docs.tigera.io/calico/latest/operations/troubleshoot/component-logs#configure-felix-log-level

Yes, we will try to enable this in our automation testing

geotransformer avatar May 02 '24 13:05 geotransformer

@geotransformer any update on this one?

caseydavenport avatar Jul 30 '24 16:07 caseydavenport