frr
frr copied to clipboard
FRR deamons restart after 17 hours of soak test with 1 million bgp RIB routes
- FRR VERSION 7.5.1
vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg# show version FRRouting 7.5.1 (vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg). Copyright 1996-2005 Kunihiro Ishiguro, et al. configured with: '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--libdir=/usr/lib64/frr' '--libexecdir=/usr/lib64/frr' '--sysconfdir=/etc/frr' '--localstatedir=/run/frr' '--with-moduledir=/usr/lib64/frr/modules' '--enable-multipath=64' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-fpm' '--enable-protobuf' '--enable-ospfapi' '--disable-exampledir' '--disable-doc' '--disable-snmp' '--disable-ldpd' '--disable-ripd' '--disable-ripngd' '--disable-eigrpd' '--disable-babeld' '--disable-isisd' '--disable-pimd' '--disable-nhrpd' '--disable-pbrd'
- OPERATING SYSTEM VERSION
[root@vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg /]# cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.5 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.5" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.5 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/" BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.5 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.5"
- KERNEL VERSION
[root@vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg /]# cat /proc/version Linux version 4.18.0-305.45.1.el8_4.x86_64 ([email protected]) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Wed Apr 6 13:48:37 EDT 2022
Beginning of tests - outputs of ps -aux, show memory and show bgp summary
vrouter-65535-ffff0abacb991501-76c588c8f-tmtgw# show running-config
Building configuration...
Current configuration:
!
frr version 7.5.1
frr defaults traditional
hostname vrouter-65535-ffff0abacb991501-76c588c8f-tmtgw
log file /var/log/frr/frr.log
log timestamp precision 3
!
password frrpass
enable password frrpass
!
interface net3e0b4170619c
ip address 172.16.4.2/24
ipv6 address 2004::2/64
!
interface net91e522342cb3
ip address 172.16.3.2/24
ipv6 address 2003::2/64
!
router bgp 55555
bgp router-id 172.16.5.1
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
no bgp network import-check
neighbor 172.16.3.1 remote-as 11111
neighbor 172.16.3.1 password 123key
neighbor 172.16.3.1 solo
neighbor 172.16.3.1 disable-connected-check
neighbor 172.16.3.1 update-source 172.16.3.2
neighbor 172.16.3.1 timers 30 90
neighbor 172.16.3.1 timers connect 30
neighbor 172.16.4.1 remote-as 11111
neighbor 172.16.4.1 password 123key
neighbor 172.16.4.1 solo
neighbor 172.16.4.1 disable-connected-check
neighbor 172.16.4.1 update-source 172.16.4.2
neighbor 172.16.4.1 timers 30 90
neighbor 172.16.4.1 timers connect 30
neighbor 2003::1 remote-as 11111
neighbor 2003::1 password 123key
neighbor 2003::1 solo
neighbor 2003::1 disable-connected-check
neighbor 2003::1 update-source 2003::2
neighbor 2003::1 timers 30 90
neighbor 2003::1 timers connect 30
neighbor 2004::1 remote-as 11111
neighbor 2004::1 password 123key
neighbor 2004::1 solo
neighbor 2004::1 disable-connected-check
neighbor 2004::1 update-source 2004::2
neighbor 2004::1 timers 30 90
neighbor 2004::1 timers connect 30
!
address-family ipv4 unicast
neighbor 172.16.3.1 activate
neighbor 172.16.3.1 route-map v4-import-gobgp in
neighbor 172.16.4.1 activate
neighbor 172.16.4.1 route-map v4-import-gobgp in
exit-address-family
!
address-family ipv6 unicast
neighbor 2003::1 activate
neighbor 2004::1 activate
exit-address-family
!
bgp as-path access-list import-v4-aspath permit 11111
!
route-map v4-match-aspath permit 10
match as-path import-v4-aspath
set local-preference 120
!
route-map v4-import-gobgp permit 10
call v4-match-aspath
!
line vty
!
end
vrouter-65535-ffff0abacb991501-76c588c8f-tmtgw#
vrouter-65535-ffff0abacb991501-76c588c8f-tmtgw# show bgp summary
IPv4 Unicast Summary:
BGP router identifier 172.16.5.1, local AS number 55555 vrf-id 0
BGP table version 55125035
RIB entries 1466835, using 269 MiB of memory
Peers 2, using 43 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
172.16.3.1 4 11111 2971424 2522783 0 0 0 00:00:34 802219 758663
172.16.4.1 4 11111 3449864 2961091 0 0 0 00:00:45 802219 43557
Total number of neighbors 2
IPv6 Unicast Summary:
BGP router identifier 172.16.5.1, local AS number 55555 vrf-id 0
BGP table version 6223651
RIB entries 163079, using 30 MiB of memory
Peers 2, using 43 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
2003::1 4 11111 1633403 566468 0 0 0 00:00:46 85715 310
2004::1 4 11111 1653465 1093286 0 0 0 00:00:40 85715 85406
Total number of neighbors 2
vrouter-65535-ffff0abacb991501-76c588c8f-tmtgw# show bgp summary
IPv4 Unicast Summary:
BGP router identifier 172.16.5.1, local AS number 55555 vrf-id 0
BGP table version 55125035
RIB entries 1466835, using 269 MiB of memory
Peers 2, using 43 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
172.16.3.1 4 11111 2971424 2522783 0 0 0 00:00:35 802219 758663
172.16.4.1 4 11111 3449864 2961091 0 0 0 00:00:46 802219 43557
Total number of neighbors 2
IPv6 Unicast Summary:
BGP router identifier 172.16.5.1, local AS number 55555 vrf-id 0
BGP table version 6223651
RIB entries 163079, using 30 MiB of memory
Peers 2, using 43 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
2003::1 4 11111 1633403 566468 0 0 0 00:00:47 85715 310
2004::1 4 11111 1653465 1093286 0 0 0 00:00:41 85715 85406
Total number of neighbors 2
2022/07/10 10:27:53.980 BGP: Terminating on signal
2022/07/10 10:27:57.194 ZEBRA: connection closed socket [32]
2022/07/10 10:27:57.194 ZEBRA: [EC 4043309122] Client 'vnc' encountered an error and is shutting down.
2022/07/10 10:27:57.194 ZEBRA: connection closed socket [29]
2022/07/10 10:27:57.194 ZEBRA: [EC 4043309122] Client 'bgp' encountered an error and is shutting down.
2022/07/10 10:27:57.235 OSPF: Terminating on signal
2022/07/10 10:27:57.252 ZEBRA: connection closed socket [39]
2022/07/10 10:27:57.252 ZEBRA: [EC 4043309122] Client 'ospf' encountered an error and is shutting down.
2022/07/10 10:27:57.371 STATIC: Terminating on signal
2022/07/10 10:27:57.387 ZEBRA: connection closed socket [13]
2022/07/10 10:27:57.387 ZEBRA: [EC 4043309122] Client 'static' encountered an error and is shutting down.
2022/07/10 10:27:57.507 ZEBRA: connection closed socket [44]
2022/07/10 10:27:57.507 ZEBRA: [EC 4043309122] Client 'bfd' encountered an error and is shutting down.
last few seconds before crash
39593243:2022/07/10 10:08:53.946 ZEBRA: [EC 100663313] SLOW THREAD: task rib_process_dplane_results (46a040) ran for 9645ms (cpu time 9574ms)
39593294:2022/07/10 10:14:19.224 ZEBRA: [EC 100663313] SLOW THREAD: task rib_process_dplane_results (46a040) ran for 60455ms (cpu time 59964ms)
39593305:2022/07/10 10:15:55.564 ZEBRA: [EC 100663313] SLOW THREAD: task rib_process_dplane_results (46a040) ran for 69619ms (cpu time 68979ms)
Describe the bug
1.Injecting ~1.7M ipv4 and ipv6 routes from gobgp into the router under test and once RIB is populated with all the routes, the neighbor is disabled using the following command
vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg# configure vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg(config)# router bgp 55555 vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg(config-router)# neighbor 172.16.3.1 shutdown vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg(config-router)# neighbor 172.16.4.1 shutdown vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg(config-router)# neighbor 2003::1 shutdown vrouter-65535-ffff6fa412b34dc0-846655598c-hnwfg(config-router)# neighbor 2004::1 shutdown
2.Again neighbors are enabled using the "no neighbor 172.16.3.1 shutdown", similarly for all neighbors and disabling again and enabling again continues for sometime and we expect this to run seamlessly for 72 hours.But after ~17 hours, all the FRR process restarts and could not get much info from FRR logs.
The CPU was high for zebra and bgp process and memory of both of them were steadily increasing.
after 13 hours of tests - outputs
[root@vrouter-65535-ffff0abacb991501-76c588c8f-tmtgw /]# ps -aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 19112 3268 ? Ss Jul09 0:00 /bin/bash /entrypoint.sh frr 31 122 1.5 1604592 758108 ? Rsl Jul09 1099:19 /usr/sbin/zebra -d -A 127.0.0.1 -M fpm:protobuf --nl-bufsize 26214400 frr 40 0.0 0.0 92832 4900 ? Ss Jul09 0:05 /usr/sbin/staticd -d -A 127.0.0.1 frr 49 68.3 2.1 1358636 1050316 ? Ssl Jul09 611:12 /usr/sbin/bgpd -d -A 127.0.0.1 frr 62 0.0 0.0 94300 6324 ? Ss Jul09 0:06 /usr/sbin/ospfd -d -a -A 127.0.0.1 frr 76 0.0 0.0 92984 5176 ? Ss Jul09 0:20 /usr/sbin/bfdd -d -A 127.0.0.1 root 86 0.0 0.0 75348 3252 ? Ss Jul09 0:19 /usr/sbin/watchfrr -d -r /usr/sbin/frr restart %s -s /usr/sbin/frr start %s -k /usr/sbin/frr stop %s zebra bgpd ospfd staticd bfdd root 99 78.7 0.0 2649276 31712 ? Sl Jul09 703:54 /opt/kaloom/bin/vrouter-agent root 3926 0.0 0.0 19368 3700 pts/0 Ss+ Jul09 0:00 /bin/bash root 4055 0.0 0.0 19368 3728 pts/1 Ss+ Jul09 0:00 /bin/bash root 5211 0.0 0.0 19368 3776 pts/2 Ss+ Jul09 0:00 /bin/bash root 7233 0.0 0.0 19368 3788 pts/3 Ss+ Jul09 0:00 /bin/bash root 15883 0.3 0.0 19368 3744 pts/4 Ss 05:49 0:00 /bin/bash root 15897 0.0 0.0 36724 2508 ? S 05:49 0:00 ping -i 0.2 -c 10 -I net91e522342cb3 fe80::3efd:feff:feb8:5f34 root 15898 0.0 0.0 54792 3968 pts/4 R+ 05:49 0:00 ps -aux [root@vrouter-65535-ffff0abacb991501-76c588c8f-tmtgw /]# vtysh -->
- [ ] Did you check if this is a duplicate issue? It is not duplicate of any existing issues which I saw in the githug frr issues
- [ ] Did you test it on the latest FRRouting/frr master branch? -No I did not test in latest frr
To Reproduce
1.populate the rib using 1.7M bgp routes as shown in "show bgp summary"
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt 172.16.3.1 4 11111 2164217 1754263 0 0 0 00:00:50 802219 269532 172.16.4.1 4 11111 2693483 2186672 0 0 0 00:00:48 802219 532688
Total number of neighbors 2
IPv6 Unicast Summary: BGP router identifier 172.16.5.1, local AS number 55555 vrf-id 0 BGP table version 4644179 RIB entries 163079, using 30 MiB of memory Peers 2, using 43 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt 2003::1 4 11111 1204560 429548 0 0 0 00:00:30 85715 85660 2004::1 4 11111 1223820 800008 0 0 0 00:00:41 85715 56
2.Disable the peers using the "neighbor shutdown" command and check if all rib routes are flushed out, again enable peers and check if all rib routes are populated again the table
3.continue steps 1 and 2 for 72 hours.
Expected behavior
The above test should continue without any fault
Actual behaviour observed:
All FRR deamons restarted after 17 hours of this test. sometimes they restart in 6 hours, sometimes in 10 hours and sometimes in 17 hours. We ran thrice to observe the behaviour.
No clue from frr logs except from what is attached as logs in the beginning.
Can you please help in debugging this further to find the root cause ?
Perhaps you could fill out the information requested when you filed the issue?
Perhaps you could fill out the information requested when you filed the issue?
Yes due to formatting, information was not available in the beginning.Attached complete information now. Can someone please help me regarding this ?
@rera1712 do you see something like a crash (backtrace) in logs or not? Because it's not clear if this is a crash or stopped somehow.
Btw, could you try with the 8.2 or master?
@rera1712 do you see something like a crash (backtrace) in logs or not? Because it's not clear if this is a crash or stopped somehow.
Btw, could you try with the 8.2 or master?
No the deamons did not crash, they all stopped unexpectedly, This was the last few lines of log in frr.
022/07/10 10:27:53.980 BGP: Terminating on signal 2022/07/10 10:27:57.194 ZEBRA: connection closed socket [32] 2022/07/10 10:27:57.194 ZEBRA: [EC 4043309122] Client 'vnc' encountered an error and is shutting down. 2022/07/10 10:27:57.194 ZEBRA: connection closed socket [29] 2022/07/10 10:27:57.194 ZEBRA: [EC 4043309122] Client 'bgp' encountered an error and is shutting down. 2022/07/10 10:27:57.235 OSPF: Terminating on signal 2022/07/10 10:27:57.252 ZEBRA: connection closed socket [39] 2022/07/10 10:27:57.252 ZEBRA: [EC 4043309122] Client 'ospf' encountered an error and is shutting down. 2022/07/10 10:27:57.371 STATIC: Terminating on signal 2022/07/10 10:27:57.387 ZEBRA: connection closed socket [13] 2022/07/10 10:27:57.387 ZEBRA: [EC 4043309122] Client 'static' encountered an error and is shutting down. 2022/07/10 10:27:57.507 ZEBRA: connection closed socket [44] 2022/07/10 10:27:57.507 ZEBRA: [EC 4043309122] Client 'bfd' encountered an error and is shutting down.
Even watchfrr was restarted.Is it possible that watchfrr is triggering this ?how to get watchfrr logs ?
Also saw these debugs in the frr before crash.
39593243:2022/07/10 10:08:53.946 ZEBRA: [EC 100663313] SLOW THREAD: task rib_process_dplane_results (46a040) ran for 9645ms (cpu time 9574ms) 39593294:2022/07/10 10:14:19.224 ZEBRA: [EC 100663313] SLOW THREAD: task rib_process_dplane_results (46a040) ran for 60455ms (cpu time 59964ms) 39593305:2022/07/10 10:15:55.564 ZEBRA: [EC 100663313] SLOW THREAD: task rib_process_dplane_results (46a040) ran for 69619ms (cpu time 68979ms)
Are above indicating anything ? Is there a possibility that some thread is stuck and not getting cpu cycle? I saw cpu was steadily increasing for zebra and bgpd process.
Sure I will also try with 8.2 once
Are you running FRR through systemd? If so, can you check journalctl -u frr and see if there are any logs indicating the service was restarted due to systemd getting involved?
According to the output, it's not running in systemd. It's a docker -> watchfrr. Could you show your /etc/frr/daemons? It might be reasonable to increase watchfrr timeout:
# man frr-watchfrr | grep -A1 -- "-T"
-T <number>, --restart-timeout <number>
Set the restart (kill) timeout in seconds (the default value is "20"). If any background jobs are still running after this period has elapsed, they will be killed.
According to the output, it's not running in systemd. It's a docker -> watchfrr. Could you show your /etc/frr/daemons? It might be reasonable to increase watchfrr timeout:
# man frr-watchfrr | grep -A1 -- "-T" -T <number>, --restart-timeout <number> Set the restart (kill) timeout in seconds (the default value is "20"). If any background jobs are still running after this period has elapsed, they will be killed.
[root@vrouter-65535-ffff92d2767e0465-5fb88b9df4-qkt2g new-mem1]# cat /etc/frr/daemons
watchfrr_enable=yes watchfrr_options="-l 8 -r '/usr/sbin/frr restart %s' -s '/usr/sbin/frr start %s' -k '/usr/sbin/frr stop %s'"
zebra=yes bgpd=yes ospfd=yes ospf6d=no ripd=no ripngd=no isisd=no ldpd=no pimd=no nhrpd=no eigrpd=no babeld=no sharpd=no pbrd=no staticd=yes bfdd=yes fabricd=no
zebra_options=("-A 127.0.0.1 -M fpm:protobuf --nl-bufsize 26214400") #zebra_options=("-A 127.0.0.1") bgpd_options=("-A 127.0.0.1") ospfd_options=("-a -A 127.0.0.1") ospf6d_options=("-A ::1") ripd_options=("-A 127.0.0.1") ripngd_options=("-A ::1") isisd_options=("-A 127.0.0.1") ldpd_options=("-A 127.0.0.1") pimd_options=("-A 127.0.0.1") nhrpd_options=("-A 127.0.0.1") eigrpd_options=("-A 127.0.0.1") babeld_options=("-A 127.0.0.1") sharpd_options=("-A 127.0.0.1") pbrd_options=("-A 127.0.0.1") staticd_options=("-A 127.0.0.1") bfdd_options=("-A 127.0.0.1") fabricd_options=("-A 127.0.0.1")
vtysh_enable=yes
How increasing [--restart-timeout] help in this case ? How about increasing timeout for -t option <Set the unresponsiveness timeout in seconds (the default value is "10").> In case if any process is busy processing and cpu is high and not responding to watchdog. Set the unresponsiveness timeout in seconds (the default value is "10").
Also I have enabled logging for watchfrr using -l
https://www.mankier.com/8/frr-watchfrr
Any way to get watchfrr logs ?
I was about that watchfrr might kill (try to restart) the daemons if they do not respond in 10 seconds or so. You can always check the output of show watchfrr as well.
I was about that watchfrr might kill (try to restart) the daemons if they do not respond in 10 seconds or so. You can always check the output of
show watchfrras well.
Also thinking of ignoring the daemons bgpd and zebra in watchfrr using the command "watchfrr ignore <DAEMON>", so that they will not be restarted even if they become unresponsiveness.
Is it possible to get watchfrr logs using any logs?
And one more issue I see is, after the restart of all the daemons by watchfrr, vtysh does not show the configuration. All confguration are gone except "router bgp 55555", but /etc/frr/frr.conf all the configuration.Is this any known issue ?
Increasing the watchfrr timeout for restarting the daemons did not help.
watchfrr was running with following options
/usr/sbin/watchfrr -d -i 10 -t 180 -T 180 -l 8 -r /usr/sbin/frr restart %s -s /usr/sbin/frr start %s -k /usr/sbin/frr stop %s zebra bgpd ospfd staticd bfdd
watchfrr received a sigint and restarted all daemons after 10 hours Terminating on signal^@and [pid 78943]: /usr/sbin/frr restart all^@^@ng sent 180 seconds ago
But "show watchfrr" never showed other than Up for any of the daemons which it was watching before restarting it. Not sure if it was not able to output it correctly during that time or watchfrr could have received sigint for any other reasons ?
Are you perchance running NetworkManager? If so please turn it off
Are you perchance running NetworkManager? If so please turn it off
I do not see NetworkManger installed in the system, so it is not running.
Latest observation: I enabled the command in watchfrr to ignore the deamons it is monitoring(zebra,bgpd,bfdd,ospfd,staticd), and I did not see the watchfrr restarting the daemons after 72 hours of soak testing with continuous peer disable/enable.
Could be case where the deamons were not responding for more than 3 minutes (180 seconds) as increasing the watchfrr restart timeout upto 180 seconds did not help, but ignoring the deamon from watchlist worked ?
something is going wrong where FRR is spending a very very long time in processing results( as evidenced by cpu run times of 70 seconds ). Can you get a perf recording and generate a svg graph of zebra when things are going bad? https://github.com/FRRouting/frr/wiki/Perf-Recording
something is going wrong where FRR is spending a very very long time in processing results( as evidenced by cpu run times of 70 seconds ). Can you get a perf recording and generate a svg graph of zebra when things are going bad? https://github.com/FRRouting/frr/wiki/Perf-Recording
Sure will go through the link suggested and will provide you the results. One more observation is, when I give some sleep/wait between the peer disable/enable, I do not see cpu going high and I was able to run the soak for 72 hours.
In the test we are waiting for ribs to get populated which takes 45 seconds to 50 seconds and immediately we do neighbor disable and again we are waiting for ribs to get flushed out which is taking around 15 to 20 seconds and we do neighbor enable immediately and we continue this for hours.
zebra does any cleanup of internal structures after every rib updates or something to be taken care with injecting of >1M routes ?Do we have to give zebra some enough time for cleanup after every rib learn and rib flush with >1M routes?
It would be easier if we can see a perf recording here.
sorry I do not have lab setup for this at this moment as it was taken for some other priority tasks, I will get back to you on this once I have the setups with the scenarios configured.
This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.
This issue will be automatically closed in the specified period unless there is further activity.