calico BGP Peering unstable if aggressive timers are used (1s/3s)

I have configured my calico nodes to peer with 2 Top of Rack switches. If the BGP timers are set to 1s KEEPALIVE and 3s HOLD the BGP session are reset (roughly once a day) by the Calico nodes.

Network Traces collected on the K8s Node itself show:

the KEEPALIVE packet from the switch being received
BIRD process does not seems to be receiving the KEEPALIVE when the issue happens
as a consequence BIRD resets the peering.

In the below screenshot, collected on the Kubernetes Node itself we can see the following:

My switch (.201) sends KEEPALIVES every 1s to Calico (.3)
Calico will sent a Hold Timer Expired Message resetting the connection

When the reset happens is quite random and enabling debug logs on BIRD is of no use as the POD just seems not to be receiving the packet itself.

I tried to peer with the same pair of switches a VM with goBGP and in that case the connection has been rock solid for days:

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
192.168.2.1     4 65003   89977   89531    21654    0    0 05:29:09 5 <--- Calico on ESXi
192.168.2.2     4 65003   90100   89642    21654    0    0 05:29:09 5 <--- Calico on ESXi
192.168.2.3     4 65003   90001   89561    21654    0    0 07:16:31 5 <--- Calico on ESXi
192.168.2.4     4 65003   89974   89524    21654    0    0 05:29:09 11 <--- Calico on ESXi
192.168.2.5     4 65003   90112   89644    21654    0    0 18:44:09 11 <--- Calico on ESXi
192.168.2.6     4 65003   90101   89641    21654    0    0 01:44:38 5 <--- Calico on ESXi
192.168.2.11    4 65003  692967  689501    21654    0    0 06:11:00 0  <--- Calico on KVM
192.168.2.12    4 65003  692971  689508    21654    0    0 12:16:25 0  <--- Calico on KVM
192.168.2.13    4 65003  692964  689507    21654    0    0 07:55:13 0  <--- Calico on KVM
192.168.2.14    4 65003  693004  689498    21654    0    0     1w1d 0  <--- goBGP on KVM
192.168.2.15    4 65003  693004  689502    21654    0    0     1w1d 0  <--- goBGP on KVM

Expected Behavior

BGP Connection is stable

Current Behavior

BGP Connection is periodically reset

Steps to Reproduce (for bugs)

I can recreate this in my lab at will by simply peering with 2x Top or Rack Swirtches and configure the timer to 1s/3s. I recreated this with BareMetal hosts and VMs on KVM and ESXi

Context

I am writing an integration guide for Calico and BGP based Datacenter Fabrics and peering with 2 switches with 1s/3s timers is the bare minimum to provide high availability and a reasonably fast switch/node failure detection.

Your Environment

I tried multipole version and have multiple environments

Calico version: I tried with several version including master
Orchestrator version: Kubernetes v1.23, v1.24, v1.25
Operating System and version: Ubuntu 20.04 21.04 22.04

Dec 13 '22 23:12 camrossi

Hey @matthewdupre, Any tips on what should be the next step for this?

Dec 17 '22 21:12 frozenprocess

If the BGP timers are set to 1s/3s the BGP session are reset

Could you say precisely which timers you are configuring here?

Dec 19 '22 18:12 caseydavenport

If the BGP timers are set to 1s/3s the BGP session are reset

Could you say precisely which timers you are configuring here?

I updated the original description with 1s KEEPALIVE and 3s HOLD hope that clarifies

Dec 19 '22 21:12 camrossi

@caseydavenport - the timers are being picked up from the upstream router. We need them as the service addresses are being picked up by calico and advertised (anycast) from the nodes that are hosting the service. The upstream router can't tell when a given node goes down (the nodes are VMs) so a chunk of service traffic is routed to a black hole. for 2-3 minutes. Affecting multiple users.

Now aware of an issue at least 5 years old with BIRD (https://bird.network.cz/pipermail/bird-users/2018-June/012461.html) probably related. We've seen this stable at 10/30, not sure where it gets stable, somewhere between 2/6 and 10/30. VM's or bare metal - it is repeatable.

Would you consider a PR turning BFD on? I remember scoping it a really long time ago...

Christopher

Jan 13 '23 03:01 liljenstolpe

@caseydavenport, @matthewdupre, @amit-tigera any update here?

Jan 25 '23 17:01 liljenstolpe

@liljenstolpe sorry about the delay. I think we're open to turning on BFD but are still trying to figure out when we are able to make this happen. I'll keep everyone updated here as more is decided.

Feb 02 '23 23:02 mgleung

We'd be happy if the original request #7366 was reopened!

bfd is, like explained above, really important for failure detection in some scenarios

Sep 11 '23 13:09 beddari

@beddari I think the plan is to track any BFD work against this issue. Is there some nuance that is missing from this issue that would be better covered in #7366 ?

Sep 19 '23 16:09 mgleung

@mgleung any verdict on having BFD config included in BGP peer CRD for Calico open source? My team is using MetalLB for this specific reason

Feb 03 '24 14:02 RefluxMeds

Sorry @RefluxMeds , I'm currently out of the loop. @caseydavenport might know more.

Feb 19 '24 02:02 mgleung

Hi @caseydavenport, Is there any work being done on this https://github.com/projectcalico/calico/issues/7086 and this https://github.com/projectcalico/calico/issues/4607 issue? My team is really interested in this feature! Are there any blockers for it? Lack of time?

Feb 20 '24 13:02 RefluxMeds

@RefluxMeds unfortunately the Calico team isn't working on this actively right now, but I would be happy to see it. The blocker is just time within the team to pick it up.

Feb 27 '24 17:02 caseydavenport

Hi @caseydavenport, The feature is not even on a roadmap? i.e. 0 plans to pick it up sometime? Thanks!

Apr 04 '24 12:04 RefluxMeds

My last comment unfortunately still stands. Would be happy to review any PRs though.

Apr 04 '24 15:04 caseydavenport

calico calico copied to clipboard

BGP Peering unstable if aggressive timers are used (1s/3s)

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

calico
calico copied to clipboard