frr icon indicating copy to clipboard operation
frr copied to clipboard

BGP adjacency not established when using linux-image-5.15.0-1019-aws

Open domi007 opened this issue 2 years ago • 2 comments


Describe the bug

  • [ X ] Did you check if this is a duplicate issue?
  • [ ] Did you test it on the latest FRRouting/frr master branch?

We have a simple eBGP setup with a peer over IPsec using a VTI interface. Everything worked fine until we upgraded to the latest kernel available for our machine: linux-image-5.15.0-1019-aws. After that the adjacency would simply not come up. I have investigated for some time, and found no issues with the configuration (obviously, it was not modified since it was working fine) and also no other issues. There are no firewall rules on this machine. Also there is no issue with the IPsec and the VTI interface, traffic flows through it, the BGP neighbor is pingable etc. We have confirmed that downgrading to linux-image-5.15.0-1017-aws fixes the issue.

According to tcpdump it seems like both ends initiate the connection, but then there are just a lot of TCP retransmissions and Duplicate ACKs after the OPEN messages. The other side sends a KEEPALIVE but FRR seems to be stuck getting it. IMPORTANT Since according to network traffic there is definitely a network stack related issue (the retransmissions and DUP ACKs hint at that) I'm also going to open a kernel bug with Ubuntu. However other TCP applications e.g. SSH work fine, so it is either a problem with Ubuntu with kernel linux-image-5.15.0-1019-aws + FRR or it might be with VTI interface + FRR or similar. Therefore I'm opening this bug here as well. In case it turns out to be a kernel issue it will still be good to have this as a reference/example of the issue demonstrated.

To Reproduce

  1. Have a machine with kernel linux-image-5.15.0-1019-aws installed
  2. Try to establish BGP adjacency with peer
  3. It remains in OpenSent state, in the background a lot of TCP retransmissions and duplicate ACKs are seen

Expected behavior

Things working, since the config was not changed, just the kernel version was bumped from linux-image-5.15.0-1017-aws to linux-image-5.15.0-1019-aws

Screenshots

Versions

  • OS Version: Ubuntu Server 22.04.1 LTS on AWS
  • Kernel: linux-image-5.15.0-1019-aws
  • FRR Version: 8.1-1ubuntu1.1 (default version installed via apt)

Additional context

Log excerpt with debug bgp neighbor-events debug bgp updates in debug bgp updates out

Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [YTARA-Q9ZD1] [Event] BGP connection from host X.X.X.X fd 25
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZKW3R-2HPPJ] [Event] New active connection from peer X.X.X.X, Killing previous active connection
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 0
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from OpenSent to Deleted
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Active established_peers 0
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Idle to Active
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] TCP_connection_open (Active->OpenSent), fd 25
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [WECS1-Q4P17] X.X.X.X passive open
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [XKJ09-9VTZ7] X.X.X.X Sending hostname cap with hn = ip-10-0-0-10, dn = (null)
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [JFFAN-DEGED] X.X.X.X sending OPEN, version 4, my as 35030, holdtime 180, id Y.Y.Y.Y
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 0
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Active to OpenSent
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [T516Q-KWWPZ] X.X.X.X [FSM] Timer (holdtime timer expire)
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] Hold_Timer_expired (OpenSent->Idle), fd 24
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [GP3Y6-AC335] X.X.X.X [FSM] Hold timer expire
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [V1CHF-JSGRR] %NOTIFICATION: sent to neighbor X.X.X.X 4/0 (Hold Timer Expired) 0 bytes
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Idle established_peers 0
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from OpenSent to Idle
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZQTB5-H8522] X.X.X.X [FSM] Timer (start timer expire).
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] BGP_Start (Idle->Connect), fd -1
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [Z195V-FNKRK] X.X.X.X [Event] Connect start to X.X.X.X fd 24
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [G0837-S7QES] X.X.X.X [FSM] Non blocking connect waiting result, fd 24
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Connect established_peers 0
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Idle to Connect
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] TCP_connection_open (Connect->OpenSent), fd 24
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [RWZTG-AA74G] X.X.X.X open active, local address Y.Y.Y.Y
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [XKJ09-9VTZ7] X.X.X.X Sending hostname cap with hn = ip-10-0-0-10, dn = (null)
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [JFFAN-DEGED] X.X.X.X sending OPEN, version 4, my as 35030, holdtime 180, id Y.Y.Y.Y
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 0
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Connect to OpenSent
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [YTARA-Q9ZD1] [Event] BGP connection from host X.X.X.X fd 26
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZKW3R-2HPPJ] [Event] New active connection from peer X.X.X.X, Killing previous active connection
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 0
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from OpenSent to Deleted
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Active established_peers 0
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Idle to Active
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] TCP_connection_open (Active->OpenSent), fd 26
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [WECS1-Q4P17] X.X.X.X passive open
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [XKJ09-9VTZ7] X.X.X.X Sending hostname cap with hn = ip-10-0-0-10, dn = (null)
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [JFFAN-DEGED] X.X.X.X sending OPEN, version 4, my as 35030, holdtime 180, id Y.Y.Y.Y
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 0
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Active to OpenSent

domi007 avatar Sep 13 '22 12:09 domi007

can we get a tcpdump of the exchange? I'd also like to see a show bgp ipv4 uni summ failed

donaldsharp avatar Sep 13 '22 15:09 donaldsharp

Sorry for the delay, I had to anonymize the PCAP I took. I made sure to keep all IDs and IPs in tact, just switched them out to different localhost ones - hopefully it worked fine. This means if in the attached PCAP a packet is coming from 127.0.0.1 with BGP ID 127.0.0.1 in real life those values would be matching as well.

ip-10-0-0-10# show bgp ipv4 uni summ failed
BGP router identifier Y.Y.Y.Y, local AS number ZZ vrf-id 0
BGP table version 2
RIB entries 3, using 552 bytes of memory
Peers 2, using 1446 KiB of memory

Neighbor        EstdCnt DropCnt ResetTime Reason
X.X.X.X       0       0     never Waiting for peer OPEN

Displayed neighbors 1
Total number of neighbors 1

BGP_adjacency_down_test01_ANON.zip

domi007 avatar Sep 15 '22 11:09 domi007

For reference here is the Ubuntu bug I opened as well: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1989470

domi007 avatar Sep 23 '22 08:09 domi007

I wonder if this bug is related to the multiple changes introduced to TCP stack via the patches listed here, since this was the only thing so far I was able to find related... I shall start compiling the kernel gradually adding patches and see if things break :D https://ubuntu.com/security/CVE-2022-1012

domi007 avatar Oct 06 '22 09:10 domi007

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

github-actions[bot] avatar Apr 05 '23 01:04 github-actions[bot]

This issue will be automatically closed in the specified period unless there is further activity.

frrbot[bot] avatar Apr 05 '23 01:04 frrbot[bot]

Kernel 5.19.0-1023-aws was released, and the problem went away with it. Since this is the newest, supported kernel release for Ubuntu 22.04 we are OK with this.

domi007 avatar Apr 05 '23 07:04 domi007

This issue will no longer be automatically closed.

frrbot[bot] avatar Apr 05 '23 07:04 frrbot[bot]