frr
frr copied to clipboard
BGP adjacency not established when using linux-image-5.15.0-1019-aws
Describe the bug
- [ X ] Did you check if this is a duplicate issue?
- [ ] Did you test it on the latest FRRouting/frr master branch?
We have a simple eBGP setup with a peer over IPsec using a VTI interface. Everything worked fine until we upgraded to the latest kernel available for our machine: linux-image-5.15.0-1019-aws. After that the adjacency would simply not come up. I have investigated for some time, and found no issues with the configuration (obviously, it was not modified since it was working fine) and also no other issues. There are no firewall rules on this machine. Also there is no issue with the IPsec and the VTI interface, traffic flows through it, the BGP neighbor is pingable etc. We have confirmed that downgrading to linux-image-5.15.0-1017-aws fixes the issue.
According to tcpdump it seems like both ends initiate the connection, but then there are just a lot of TCP retransmissions and Duplicate ACKs after the OPEN
messages. The other side sends a KEEPALIVE
but FRR seems to be stuck getting it.
IMPORTANT
Since according to network traffic there is definitely a network stack related issue (the retransmissions and DUP ACKs hint at that) I'm also going to open a kernel bug with Ubuntu. However other TCP applications e.g. SSH work fine, so it is either a problem with Ubuntu with kernel linux-image-5.15.0-1019-aws + FRR or it might be with VTI interface + FRR or similar. Therefore I'm opening this bug here as well. In case it turns out to be a kernel issue it will still be good to have this as a reference/example of the issue demonstrated.
To Reproduce
- Have a machine with kernel linux-image-5.15.0-1019-aws installed
- Try to establish BGP adjacency with peer
- It remains in OpenSent state, in the background a lot of TCP retransmissions and duplicate ACKs are seen
Expected behavior
Things working, since the config was not changed, just the kernel version was bumped from linux-image-5.15.0-1017-aws to linux-image-5.15.0-1019-aws
Screenshots
Versions
- OS Version: Ubuntu Server 22.04.1 LTS on AWS
- Kernel: linux-image-5.15.0-1019-aws
- FRR Version: 8.1-1ubuntu1.1 (default version installed via
apt
)
Additional context
Log excerpt with debug bgp neighbor-events
debug bgp updates in
debug bgp updates out
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [YTARA-Q9ZD1] [Event] BGP connection from host X.X.X.X fd 25
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZKW3R-2HPPJ] [Event] New active connection from peer X.X.X.X, Killing previous active connection
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 0
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from OpenSent to Deleted
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Active established_peers 0
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Idle to Active
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] TCP_connection_open (Active->OpenSent), fd 25
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [WECS1-Q4P17] X.X.X.X passive open
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [XKJ09-9VTZ7] X.X.X.X Sending hostname cap with hn = ip-10-0-0-10, dn = (null)
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [JFFAN-DEGED] X.X.X.X sending OPEN, version 4, my as 35030, holdtime 180, id Y.Y.Y.Y
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 0
Sep 13 12:10:28 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Active to OpenSent
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [T516Q-KWWPZ] X.X.X.X [FSM] Timer (holdtime timer expire)
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] Hold_Timer_expired (OpenSent->Idle), fd 24
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [GP3Y6-AC335] X.X.X.X [FSM] Hold timer expire
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [V1CHF-JSGRR] %NOTIFICATION: sent to neighbor X.X.X.X 4/0 (Hold Timer Expired) 0 bytes
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Idle established_peers 0
Sep 13 12:11:06 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from OpenSent to Idle
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZQTB5-H8522] X.X.X.X [FSM] Timer (start timer expire).
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] BGP_Start (Idle->Connect), fd -1
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [Z195V-FNKRK] X.X.X.X [Event] Connect start to X.X.X.X fd 24
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [G0837-S7QES] X.X.X.X [FSM] Non blocking connect waiting result, fd 24
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Connect established_peers 0
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Idle to Connect
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] TCP_connection_open (Connect->OpenSent), fd 24
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [RWZTG-AA74G] X.X.X.X open active, local address Y.Y.Y.Y
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [XKJ09-9VTZ7] X.X.X.X Sending hostname cap with hn = ip-10-0-0-10, dn = (null)
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [JFFAN-DEGED] X.X.X.X sending OPEN, version 4, my as 35030, holdtime 180, id Y.Y.Y.Y
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 0
Sep 13 12:11:07 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Connect to OpenSent
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [YTARA-Q9ZD1] [Event] BGP connection from host X.X.X.X fd 26
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZKW3R-2HPPJ] [Event] New active connection from peer X.X.X.X, Killing previous active connection
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 0
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from OpenSent to Deleted
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Active established_peers 0
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Idle to Active
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZWCSR-M7FG9] X.X.X.X [FSM] TCP_connection_open (Active->OpenSent), fd 26
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [WECS1-Q4P17] X.X.X.X passive open
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [XKJ09-9VTZ7] X.X.X.X Sending hostname cap with hn = ip-10-0-0-10, dn = (null)
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [JFFAN-DEGED] X.X.X.X sending OPEN, version 4, my as 35030, holdtime 180, id Y.Y.Y.Y
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 0
Sep 13 12:12:09 ip-10-0-0-10 bgpd[566]: [ZQHFG-DQGX1] X.X.X.X went from Active to OpenSent
can we get a tcpdump of the exchange? I'd also like to see a show bgp ipv4 uni summ failed
Sorry for the delay, I had to anonymize the PCAP I took. I made sure to keep all IDs and IPs in tact, just switched them out to different localhost ones - hopefully it worked fine. This means if in the attached PCAP a packet is coming from 127.0.0.1 with BGP ID 127.0.0.1 in real life those values would be matching as well.
ip-10-0-0-10# show bgp ipv4 uni summ failed
BGP router identifier Y.Y.Y.Y, local AS number ZZ vrf-id 0
BGP table version 2
RIB entries 3, using 552 bytes of memory
Peers 2, using 1446 KiB of memory
Neighbor EstdCnt DropCnt ResetTime Reason
X.X.X.X 0 0 never Waiting for peer OPEN
Displayed neighbors 1
Total number of neighbors 1
For reference here is the Ubuntu bug I opened as well: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1989470
I wonder if this bug is related to the multiple changes introduced to TCP stack via the patches listed here, since this was the only thing so far I was able to find related... I shall start compiling the kernel gradually adding patches and see if things break :D https://ubuntu.com/security/CVE-2022-1012
This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose
label in order to avoid having this issue closed.
This issue will be automatically closed in the specified period unless there is further activity.
Kernel 5.19.0-1023-aws was released, and the problem went away with it. Since this is the newest, supported kernel release for Ubuntu 22.04 we are OK with this.
This issue will no longer be automatically closed.