flannel icon indicating copy to clipboard operation
flannel copied to clipboard

Pod cannot ping each other in multi-host scenario - failed to add vxlanRoute (XXX -> X.Y.0.0): invalid argument

Open senwangrockets opened this issue 7 years ago • 20 comments

Pod from different host cannot ping each others. Flannel logs as below:

I1018 17:58:53.498781       1 main.go:470] Determining IP address of default interface
I1018 17:58:53.499196       1 main.go:483] Using interface with name eth0 and address 172.28.249.156
I1018 17:58:53.499243       1 main.go:500] Defaulting external address to interface address (172.28.249.156)
I1018 17:58:53.517275       1 kube.go:130] Waiting 10m0s for node controller to sync
I1018 17:58:53.517332       1 kube.go:283] Starting kube subnet manager
I1018 17:58:54.517591       1 kube.go:137] Node controller sync successful
I1018 17:58:54.517652       1 main.go:235] Created subnet manager: Kubernetes Subnet Manager - scarif-admin-2
I1018 17:58:54.517661       1 main.go:238] Installing signal handlers
I1018 17:58:54.517821       1 main.go:348] Found network config - Backend type: vxlan
I1018 17:58:54.517912       1 vxlan.go:119] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
I1018 17:58:54.573370       1 main.go:295] Wrote subnet file to /run/flannel/subnet.env
I1018 17:58:54.573408       1 main.go:299] Running backend.
I1018 17:58:54.573427       1 main.go:317] Waiting for all goroutines to exit
I1018 17:58:54.573496       1 vxlan_network.go:56] watching for new subnet leases
**E1018 17:58:54.573780       1 vxlan_network.go:158] failed to add vxlanRoute (172.16.0.0/24 -> 172.16.0.0): invalid argument**
I1018 17:58:54.577620       1 ipmasq.go:75] Some iptables rules are missing; deleting and recreating rules
I1018 17:58:54.577673       1 ipmasq.go:97] Deleting iptables rule: -s 172.16.0.0/16 -d 172.16.0.0/16 -j RETURN
I1018 17:58:54.579324       1 ipmasq.go:97] Deleting iptables rule: -s 172.16.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1018 17:58:54.580870       1 ipmasq.go:97] Deleting iptables rule: ! -s 172.16.0.0/16 -d 172.16.1.0/24 -j RETURN
I1018 17:58:54.582349       1 ipmasq.go:97] Deleting iptables rule: ! -s 172.16.0.0/16 -d 172.16.0.0/16 -j MASQUERADE
I1018 17:58:54.583900       1 ipmasq.go:85] Adding iptables rule: -s 172.16.0.0/16 -d 172.16.0.0/16 -j RETURN
I1018 17:58:54.587553       1 ipmasq.go:85] Adding iptables rule: -s 172.16.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1018 17:58:54.591290       1 ipmasq.go:85] Adding iptables rule: ! -s 172.16.0.0/16 -d 172.16.1.0/24 -j RETURN
I1018 17:58:54.595032       1 ipmasq.go:85] Adding iptables rule: ! -s 172.16.0.0/16 -d 172.16.0.0/16 -j MASQUERADE

Your Environment

  • Flannel version: 0.9
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version:
  • Kubernetes version (if used): 1.8
  • Operating System and version: Centos 7.3 Docker 17.06

senwangrockets avatar Oct 18 '17 18:10 senwangrockets

What I think is interesting is " E1018 17:58:54.573780 1 vxlan_network.go:158] failed to add vxlanRoute (172.16.0.0/24 -> 172.16.0.0): invalid argument "

senwangrockets avatar Oct 19 '17 13:10 senwangrockets

Yes, that line is the smoking gun. What other nodes do you have? Can you output the flannel annotation you have on your nodes (something like kubectl get nodes -o yaml |grep flannel.alpha).

Somehow, I think one of your nodes has a PublicIP of 172.16.0.0 which it shouldn't do. The 172.16/16 range should be reserved for the vxlan network.

tomdee avatar Oct 20 '17 19:10 tomdee

I have a similar issue, same versions of flannel, k8s. Using vxlan, flannel is up and running, no errors in the logs (not even the error above).

kubeadm 1.8.1
k8s 1.8.0
flannel 0.9
ubuntu 16.04
docker 17.03ce

I've tried combinations of k8s as far back as 1.6 and flannel as far back as 0.8, all with the same results.

I'm able to connect pod <-> pod and host <-> pod as long as the pods are on that host. All hosts can communicate with each other without issues. I've spent almost a month fiddling with iptables, routes, etc and cannot figure this out. I'm seeing traffic via tcpdump on the cni0 bridge, but my pods aren't getting it. IIRC, last night I was using iptstate and was seeing udp traffic on the bridge when I expected tcp. Maybe this is the issue? It's also possible I was seeing something else...

Should I open another ticket, or piggy back on this one?

camflan avatar Oct 24 '17 16:10 camflan

I'm running into the same issue it seems.

I1026 22:38:06.797811     208 vxlan_network.go:56] watching for new subnet leases
I1026 22:38:06.800429     208 ipmasq.go:75] Some iptables rules are missing; deleting and recreating rules
I1026 22:38:06.800450     208 ipmasq.go:97] Deleting iptables rule: -s 172.17.0.0/16 -d 172.17.0.0/16 -j RETURN
I1026 22:38:06.801507     208 ipmasq.go:97] Deleting iptables rule: -s 172.17.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1026 22:38:06.802527     208 ipmasq.go:97] Deleting iptables rule: ! -s 172.17.0.0/16 -d 172.17.9.0/24 -j RETURN
I1026 22:38:06.803535     208 ipmasq.go:97] Deleting iptables rule: ! -s 172.17.0.0/16 -d 172.17.0.0/16 -j MASQUERADE
I1026 22:38:06.804543     208 ipmasq.go:85] Adding iptables rule: -s 172.17.0.0/16 -d 172.17.0.0/16 -j RETURN
I1026 22:38:06.806706     208 ipmasq.go:85] Adding iptables rule: -s 172.17.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1026 22:38:06.808932     208 ipmasq.go:85] Adding iptables rule: ! -s 172.17.0.0/16 -d 172.17.9.0/24 -j RETURN
I1026 22:38:06.811148     208 ipmasq.go:85] Adding iptables rule: ! -s 172.17.0.0/16 -d 172.17.0.0/16 -j MASQUERADE
E1026 22:38:11.064786     208 vxlan_network.go:158] failed to add vxlanRoute (172.17.0.0/24 -> 172.17.0.0): invalid argument
E1027 02:51:24.265565     208 vxlan_network.go:158] failed to add vxlanRoute (172.17.0.0/24 -> 172.17.0.0): invalid argument

@tomdee none of my nodes have that as the public ip annotation (they're all correct).

jhorwit2 avatar Oct 27 '17 03:10 jhorwit2

I don't see a route for 172.17.0.0/24 on any of my hosts.

$ ip route
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.17.1.0/24 via 172.17.1.0 dev flannel.1 onlink
172.17.2.0/24 via 172.17.2.0 dev flannel.1 onlink
172.17.3.0/24 via 172.17.3.0 dev flannel.1 onlink
172.17.4.0/24 via 172.17.4.0 dev flannel.1 onlink
172.17.5.0/24 via 172.17.5.0 dev flannel.1 onlink
172.17.6.0/24 via 172.17.6.0 dev flannel.1 onlink
172.17.7.0/24 via 172.17.7.0 dev flannel.1 onlink
172.17.8.0/24 via 172.17.8.0 dev flannel.1 onlink
172.17.9.2 dev cali299270d87b6 scope link
172.17.9.3 dev calib63aee49779 scope link
172.17.9.4 dev cali12d4a061371 scope link
$ arp -a
...
? (172.17.0.0) at <incomplete> on flannel.1
...

Flannel logs

I1027 12:53:29.439503     166 vxlan_network.go:138] adding subnet: 172.17.0.0/24 PublicIP: 10.65.27.18 VtepMAC: 46:ee:d0:82:55:a4
I1027 12:53:29.439524     166 device.go:179] calling AddARP: 172.17.0.0, 46:ee:d0:82:55:a4
I1027 12:53:29.439591     166 device.go:156] calling AddFDB: <hostip>, 46:ee:d0:82:55:a4
E1027 12:53:29.439668     166 vxlan_network.go:158] failed to add vxlanRoute (172.17.0.0/24 -> 172.17.0.0): invalid argument
I1027 12:53:29.439706     166 device.go:190] calling DelARP: 172.17.0.0, 46:ee:d0:82:55:a4
I1027 12:53:29.439751     166 device.go:168] calling DelFDB: <hostip>, 46:ee:d0:82:55:a4

jhorwit2 avatar Oct 27 '17 03:10 jhorwit2

I had this error too when transitioning from 1.7.5 to 1.8.2. A reboot solved this error for me. (for completenes: prior to this I deleted the fstab swap entry because kubelet requires that the system doesnt swap. Not sure If this is related)

DominicDV avatar Oct 27 '17 14:10 DominicDV

@camflan please open a different issue. I suspect you just need "iptables -P FORWARD ACCEPT"

tomdee avatar Nov 04 '17 00:11 tomdee

@jhorwit2 @senwangrockets I think the problem could be that you have the same IP range configured for your Docker bridge as you do for flannel. If you're using kubeadm, did you specify --pod-network-cidr 10.244.0.0/16

tomdee avatar Nov 04 '17 00:11 tomdee

@tomdee that was my issue. Sorry I forgot to post after I realized that.

jhorwit2 avatar Nov 04 '17 01:11 jhorwit2

@tomdee Hi Tom,

I initialised my cluster with same kubeadm command kubeadm init --pod-network-cidr 10.244.0.0/16 But Still in Flannel pods I see errors

E1210 07:10:45.198903 1 vxlan_network.go:158] failed to add vxlanRoute (10.244.2.0/24 -> 10.244.2.0): invalid argument

I have 4 host cluster 2 of them works fine but other 2 fails to schedule container

Always in state of "ContainerCreating"

Errors which I see is

Dec 10 01:39:14 kongapi-poc-db1 kubelet: E1210 01:39:14.554032   58034 cni.go:250] Error while adding to cni network: "cni0" already has an IP address different from 10.244.3.1/24
Dec 10 01:39:14 kongapi-poc-db1 kernel: cni0: port 1(veth7b12c96f) entered disabled state
Dec 10 01:39:14 kongapi-poc-db1 kernel: device veth7b12c96f left promiscuous mode
Dec 10 01:39:14 kongapi-poc-db1 kernel: cni0: port 1(veth7b12c96f) entered disabled state
Dec 10 01:39:14 kongapi-poc-db1 NetworkManager[702]: <info>  [1512898754.6477] device (veth7b12c96f): released from master device cni0
Dec 10 01:39:14 kongapi-poc-db1 kubelet: E1210 01:39:14.655974   58034 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "tomcat-d6b5b9647-prq9w_tomcat" network: "cni0" already has an IP address different from 10.244.3.1/24

kumarganesh2814 avatar Dec 10 '17 09:12 kumarganesh2814

Having the same problem. 4 nodes, 2 masters and 2 workers. the .167 and .168 are the workers and .167 is the one that's having issues adding the route.

Output of: kubectl get nodes -o yaml |grep flannel.alpha

      flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"d2:28:18:cd:1d:82"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 10.1.130.165
      flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"b6:67:12:1c:d9:c4"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 10.1.130.166
      flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"aa:e0:31:6e:d1:ef"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 10.1.130.167
      flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"16:13:d5:7c:c5:e2"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 10.1.130.168

eroji avatar Feb 09 '18 01:02 eroji

Are the invalid gateway addresses treated as multicast address by linux? The subnet allocation in flannel will skip the multicast addresses https://github.com/coreos/flannel/blob/master/subnet/config.go#L86-L93. But using the podCidr allocated by "controller manager" not skip the first subnet.

@tomdee

BSWANG avatar Mar 01 '18 12:03 BSWANG

@tomdee Hi Tom,

I initialised my cluster with same kubeadm command kubeadm init --pod-network-cidr 10.244.0.0/16 But Still in Flannel pods I see errors

E1210 07:10:45.198903 1 vxlan_network.go:158] failed to add vxlanRoute (10.244.2.0/24 -> 10.244.2.0): invalid argument

I have 4 host cluster 2 of them works fine but other 2 fails to schedule container

Always in state of "ContainerCreating"

Errors which I see is

Dec 10 01:39:14 kongapi-poc-db1 kubelet: E1210 01:39:14.554032   58034 cni.go:250] Error while adding to cni network: "cni0" already has an IP address different from 10.244.3.1/24
Dec 10 01:39:14 kongapi-poc-db1 kernel: cni0: port 1(veth7b12c96f) entered disabled state
Dec 10 01:39:14 kongapi-poc-db1 kernel: device veth7b12c96f left promiscuous mode
Dec 10 01:39:14 kongapi-poc-db1 kernel: cni0: port 1(veth7b12c96f) entered disabled state
Dec 10 01:39:14 kongapi-poc-db1 NetworkManager[702]: <info>  [1512898754.6477] device (veth7b12c96f): released from master device cni0
Dec 10 01:39:14 kongapi-poc-db1 kubelet: E1210 01:39:14.655974   58034 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "tomcat-d6b5b9647-prq9w_tomcat" network: "cni0" already has an IP address different from 10.244.3.1/24

I am not sure if this will help, but you might want to delete all the network/bridge devices before initializing k8s again. I had similar issues but I destroyed and created new VMs which resolved my similar issue. However, the issues might not be the same.

After reading flannel documentation, it was not obvious to me that flannel works one cidr only. But after the change things are much better, although with other issues.

nabheet avatar Dec 03 '18 18:12 nabheet

@senwangrockets @kumarganesh2814 ,I have the same problem. Have you solved it ?

leogoing avatar Jul 22 '19 09:07 leogoing

I got the same problem here is how I resolved. I have a 1 master 2 worker nodes setup, all of them are VMs. they have fixed ip and hostnames in my local are network. master and 1 worker node is ok. 1 worker node has this problem.

when I see something like this: vxlan_network.go:158] failed to add vxlanRoute (10.244.2.0/24 -> 10.244.2.0): invalid argument, I would log onto that machine and check the ip address of cni0, it could be a different address. you could delete the interface and let the cluster re-generate. but my side of the problem is that I realized the flannel.1 interface was not created.

so I delete the node, manually delete the associated pods from master, and did kubectl reset on the problematic worker node. and rejoined. but the flannel.1 never appear. In the end, I deleted the node from master and and did a reset. Restart the vm, and join master just like normal, flannel.1 appeared. And I did a deployment on master. On the worker node, cni0 and veth appeared.

TLDR: not sure whether it would work but: delete worker node from master, worker node kubectl reset, clean up , Restart vm, join master node as normal.

Voxis avatar Oct 31 '20 01:10 Voxis

I also faced this problem,this is because the network interface which flanneld use can't access each other,i use another network interface then sovled

Queetinliu avatar Aug 03 '21 04:08 Queetinliu

mine so weird on this flannel.alpha.coreos.com/public-ip: 10.0.3.15. this is my master, now my master cannot ping others flannel. what is actually happened here and how to edit the flannel.alpha on my master?

kubectl get nodes -o yaml |grep flannel.alpha

      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"16:cb:5c:78:57:cb"}'

      flannel.alpha.coreos.com/backend-type: vxlan

      flannel.alpha.coreos.com/kube-subnet-manager: "true"

      flannel.alpha.coreos.com/public-ip: 192.168.14.3

      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"7e:1e:e8:f6:8f:77"}'

      flannel.alpha.coreos.com/backend-type: vxlan

      flannel.alpha.coreos.com/kube-subnet-manager: "true"

      flannel.alpha.coreos.com/public-ip: 192.168.14.4

      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"06:cd:6a:ba:6b:54"}'

      flannel.alpha.coreos.com/backend-type: vxlan

      flannel.alpha.coreos.com/kube-subnet-manager: "true"

      flannel.alpha.coreos.com/public-ip: 10.0.3.15

      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"96:71:0e:48:52:4d"}'

      flannel.alpha.coreos.com/backend-type: vxlan

      flannel.alpha.coreos.com/kube-subnet-manager: "true"

      flannel.alpha.coreos.com/public-ip: 192.168.14.2

rthamrin avatar Jan 27 '22 08:01 rthamrin

check the flannel.1 is conflicted with the docker0's ip, if conflicted, change the subnet's ip range

dale1202 avatar Mar 20 '22 11:03 dale1202

check the flannel.1 is conflicted with the docker0's ip, if conflicted, change the subnet's ip range

sorry, to whom your answer go with?

rthamrin avatar Mar 21 '22 00:03 rthamrin

@rthamrin i followed this question: "failed to add vxlanRoute (10.244.2.0/24 -> 10.244.2.0): invalid argument"

dale1202 avatar Mar 21 '22 01:03 dale1202

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 25 '23 20:01 stale[bot]