flannel icon indicating copy to clipboard operation
flannel copied to clipboard

subnets: consider backend data during watch reset

Open knisbet opened this issue 4 years ago • 3 comments

This fixes an issue we encountered, where if an etcd watch is reset flannel can miss processing a mac address change when using the vxlan backend. For a subnet lease to be considered equal during watch reset the backend data must also be unchanged.

Type of fix: Bug fix Testing completed: make test

Todos

  • [x] Still setting up tests to trigger the specific scenario within our clusters, so I'm submitting this ahead of that testing incase the change stands out as potentially incorrect to a reviewer.

Edit: manual testing of the fix under triggered conditions is working. Steps to reproduce:

  1. Use iptables to block connectivity to etcd: iptables -I INPUT 1 -p tcp --source-port 2379 -j DROP && iptables -I OUTPUT 1 -p tcp --destination-port 2379 -j DROP
  2. Restart flannel on another cluster node with a reset mac address (delete flannel.1 or reboot the node)
  3. Generate etcd traffic to advance the history window
  4. Remove the iptables rule and observe the reconnection: iptables -D INPUT 1 && iptables -D OUTPUT 1

Observed Results:

Feb 03 20:57:15 kevin-test5 flanneld[29475]: E0203 20:57:15.959398   29475 watch.go:44] Watch subnets: client: etcd cluster is unavailable or misconfigured; error #0: read tcp 127.0.0.1:58562->127.0.0.1:2379: read: connection timed out
Feb 03 20:57:16 kevin-test5 flanneld[29475]: W0203 20:57:16.968459   29475 local_manager.go:345] Watch of subnet leases failed because etcd index outside history window
Feb 03 20:57:16 kevin-test5 flanneld[29475]: I0203 20:57:16.971181   29475 vxlan_network.go:138] adding subnet: 100.96.65.0/24 PublicIP: 10.162.0.6 VtepMAC: 5a:74:8f:aa:32:9f
Feb 03 20:57:16 kevin-test5 flanneld[29475]: I0203 20:57:16.971415   29475 vxlan_network.go:138] adding subnet: 100.96.74.0/24 PublicIP: 10.162.0.5 VtepMAC: 7e:09:75:b5:22:92
Feb 03 20:57:16 kevin-test5 flanneld[29475]: I0203 20:57:16.971502   29475 vxlan_network.go:138] adding subnet: 100.96.53.0/24 PublicIP: 10.162.0.7 VtepMAC: 8a:2b:5a:10:09:58
Feb 03 20:57:36 kevin-test5 flanneld[29475]: E0203 20:57:36.439379   29475 watch.go:176] Subnet watch failed: client: etcd cluster is unavailable or misconfigured; error #0: read tcp 127.0.0.1:58560->127.0.0.1:2379: read: connection timed out
Feb 03 20:57:37 kevin-test5 flanneld[29475]: W0203 20:57:37.448334   29475 local_manager.go:317] Watch of subnet leases failed because etcd index outside history window
Feb 03 20:57:37 kevin-test5 flanneld[29475]: I0203 20:57:37.451540   29475 main.go:418] Waiting for 22h3m40.997388723s to renew lease
  • In the above example I restarted 3 nodes, matching the added events above.

knisbet avatar Jan 27 '20 20:01 knisbet

@rajatchopra take a peek and try and reproduce

luthermonson avatar Jan 21 '21 17:01 luthermonson

Also, it would be nice to have a test added for this.

rajatchopra avatar Jan 21 '21 17:01 rajatchopra

Are you still interested in working on this PR?

thomasferrandiz avatar Sep 15 '22 15:09 thomasferrandiz

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 14 '23 17:03 stale[bot]