frr
frr copied to clipboard
bgpd from version 8.4 to 9.1 (current stable) memory leak when using default-originate route map (set large-community / set community)
bgpd# show version FRRouting 9.0.1 (gazgw) on Linux(6.1.0-12-amd64). Copyright 1996-2005 Kunihiro Ishiguro, et al. configured with: '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--localstatedir=/var/run/frr' '--sbindir=/usr/lib/frr' '--sysconfdir=/etc/frr' '--with-vtysh-pager=/usr/bin/pager' '--libdir=/usr/lib/x86_64-linux-gnu/frr' '--with-moduledir=/usr/lib/x86_64-linux-gnu/frr/modules' '--disable-dependency-tracking' '--enable-rpki' '--disable-scripting' '--enable-pim6d' '--with-libpam' '--enable-doc' '--enable-doc-html' '--enable-snmp' '--enable-fpm' '--disable-protobuf' '--disable-zeromq' '--enable-ospfapi' '--enable-bgp-vnc' '--enable-multipath=256' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' 'build_alias=x86_64-linux-gnu' 'PYTHON=python3'
system top reports:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15022 frr 20 0 6820200 6.4g 5640 S 0.3 20.4 1:31.66 bgpd
1668 frr 20 0 636332 186880 2040 S 0.0 0.6 59:08.28 zebra
bgpd -> show memory
System allocator statistics: Total heap allocated: > 2GB Holding block headers: 10252 KiB Used small blocks: 0 bytes Used ordinary blocks: > 2GB Free small blocks: 3712 bytes Free ordinary blocks: 5282 KiB Ordinary blocks: 1404 Small blocks: 74 Holding blocks: 3 (see system documentation for 'mallinfo' for meaning) --- qmem libfrr --- Type : Current# Size Total Max# MaxBytes Buffer : 6 24 144 6 144 Buffer data : 1 variable 4120 8 32960 Host config : 8 variable 304 8 304 Completion item : 0 variable 0 1 24 Command Tokens : 12944 72 932672 12966 934272 Command Token Text : 9273 variable 323864 9294 330096 Command Token Help : 9273 variable 223544 9294 224048 Command Argument : 2 variable 48 24 576 Command Argument Name : 2248 variable 53984 2269 54488 Command Match Stack : 0 variable 0 125 3000 Lexer token (temporary) : 0 variable 0 2 64 Access List : 1 56 56 1 56 Access List Str : 1 14 24 1 24 Access Filter : 2 112 240 2 240 RCU thread : 2 128 272 2 272 FRR POSIX Thread : 4 variable 320 4 320 POSIX sync primitives : 4 variable 192 4 192 Graph : 41 8 984 42 1008 Graph Node : 15224 32 611040 15226 611120 Hash : 703 variable 34376 703 34376 Hash Bucket : 355646 32 14231408 360167 14412216 Hash Index : 352 variable 5645048 353 7746280 Interface : 14 272 3920 14 3920 Connected : 25 48 1416 25 1416 Link List : 159 40 6408 163 6576 Link Node : 680 24 16512 807 19608 Temporary memory : 217 variable 11656 229 12168 Bitfield memory : 2 variable 10256 2 10256 Nexthop : 237 152 36200 237 36200 Northbound Node : 258 1192 307536 258 307536 Northbound Configuration : 2 24 48 2 48 Northbound Configuration Entry: 372 1032 384144 372 384144 Prefix List : 10 88 880 10 880 Prefix List Str : 10 variable 240 10 240 Prefix List Entry : 44 136 6048 44 6048 Prefix List Trie Table : 28 4096 114912 28 114912 Prefix : 25 56 1480 25 1480 Privilege information : 3 variable 136 3 136 Ring buffer : 62 variable 3048992 62 3048992 Route map : 22 120 2640 22 2640 Route map name : 171 variable 4104 172 4128 Route map index : 92 152 14464 92 14464 Route map rule : 201 40 8088 201 8088 Route map rule str : 192 variable 4944 192 4944 Route map compiled : 267 variable 6408 267 6408 Route map dependency : 42 24 1008 42 1008 Route map dependency data : 75 16 1800 75 1800 Skip List : 2 56 112 2 112 Skip Node : 2 160 336 2 336 Skiplist Counters : 2 68 144 2 144 Socket union : 44 112 5280 44 5280 Stream : 49 variable 2602472 33800 7476528 Stream FIFO : 62 64 4464 62 4464 Route table : 144 56 8128 144 8128 Route node : 56 120 6880 56 6880 Thread : 81 160 13640 110 18544 Thread master : 12 variable 50352 12 50352 Thread Poll Info : 6 8192 49200 6 49200 Thread stats : 29 96 3032 29 3032 Typed-hash bucket : 53 variable 6304328 53 6304328 Typed-heap array : 1 576 584 1 584 Vector : 30532 24 733952 30537 734072 Vector index : 30532 variable 963376 30537 963880 VRF : 1 216 216 1 216 VRF bit-map : 3 8 72 3 72 VTY : 5 variable 61256 5 61256 VTY server : 2 32 96 2 96 VTY output buffer : 0 1038 0 1 1048 VTY history : 9 variable 216 9 216 Work queue : 3 144 456 3 456 Work queue item : 0 24 0 7 168 Work queue name string : 3 variable 72 3 72 YANG module : 5 48 296 5 296 YANG data structure : 0 1032 0 1 1048 Zclient : 2 3144 6288 2 6288 Redistribution instance IDs : 6 2 144 6 144 log thread-local buffer : 2 24608 49232 2 49232 --- qmem logging subsystem --- Type : Current# Size Total Max# MaxBytes log file target : 1 88 88 1 88 log file name : 1 22 24 1 24 syslog target : 1 56 56 1 56 --- qmem bgpd --- Type : Current# Size Total Max# MaxBytes Peer KeepAlive Timer : 22 24 528 22 528 BGP Peer pthread Conditional : 1 48 56 1 56 BGP Peer pthread Mutex : 1 40 40 1 40 Mac Hash Entry : 7 16 168 7 168 Mac Hash Entry Intf String : 14 variable 336 15 360 BGP instance : 25 variable 10760 25 10760 BGP listen socket details : 2 144 304 2 304 BGP peer : 53 20864 1106216 53 1106216 BGP peer hostname : 70 variable 1920 71 1960 Peer group : 4 64 288 4 288 BGP Peer group hostname : 4 9 96 4 96 Peer description : 26 variable 624 26 624 BGP peer af : 26 80 2288 26 2288 BGP update group : 22 104 2288 22 2288 BGP update subgroup : 22 240 5456 22 5456 BGP packet : 22 56 1232 32 1792 BGP attribute : 273809 320 89809800 273825 89815048 BGP aspath : 78026 40 3121184 78068 3122864 BGP aspath seg : 78057 24 1874040 78100 1875296 BGP aspath segment data : 78057 variable 2014600 78100 2015104 BGP aspath str : 78027 variable 5007672 78070 5010080 BGP table : 87 56 4872 87 4872 BGP node : 607055 192 121431832 607169 121454504 BGP route : 558869 136 76066488 558971 76080232 BGP ancillary route info : 1 432 440 1 440 BGP connected : 16 4 384 16 384 BGP static : 7 152 1064 7 1064 BGP adv attr : 0 24 0 12594 302368 BGP adv : 0 64 0 56940 4099760 BGP synchronise : 22 48 1232 22 1232 BGP adj in : 572533 48 34468296 572635 34478776 BGP adj out : 514729 96 53844360 514849 53857560 BGP multipath info : 679 48 38024 679 38024 BGP AS list : 1 40 40 1 40 BGP AS filter : 2 48 112 2 112 BGP AS filter str : 2 variable 48 2 48 community : 532 40 21312 534 21408 community val : 532 variable 18624 534 18672 community str : 520 variable 4264000 521 4272216 extcommunity : 53 40 2136 54 2192 extcommunity val : 53 variable 1352 54 1392 extcommunity str : 53 variable 7720 53 7720 community-list : 35 56 1976 35 1976 community-list name : 35 variable 840 35 840 community-list entry : 35 56 2120 35 2120 community-list config : 32 variable 768 32 768 community-list handler : 1 120 136 1 136 BGP Process queue : 0 32 0 7 280 BGP transit attr : 5 24 120 5 120 BGP transit val : 5 25 200 5 200 BGP nexthop : 237 184 43608 237 43608 BGP regexp : 34 64 2448 34 2448 BGP own address : 16 64 1152 16 1152 BGP Filter Information : 118 variable 2896 120 2944 Large Community : 79424644 40 3177027200 79424645 3177027240 Large Community display string: 149 variable 9128 149 9128 Large Community value : 79424644 variable 1906212544 79424645 1906212568 BGP EVPN MH Information : 1 56 72 1 72 Software Version : 2 16 48 2 48 BGP Martian Addr Intf String : 16 variable 384 16 384 BGP PBR Context : 1 32 40 1 40 BGP interface context : 14 4 336 14 336 BGP EVPN instance information : 1 56 56 1 56 --- qmem rfapi --- Type : Current# Size Total Max# MaxBytes NVE Configuration : 1 2984 2984 1 2984 RFAPI Generic : 1 296 296 1 296 RFAPI Import Table : 1 208 216 1 216
~ 10 minutes later
system top reports:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15022 frr 20 0 11.2g 11.0g 5640 S 0.3 35.4 2:29.07 bgpd
1668 frr 20 0 636332 186880 2040 S 0.0 0.6 59:09.59 zebra
show memory
System allocator statistics: Total heap allocated: > 2GB Holding block headers: 10252 KiB Used small blocks: 0 bytes Used ordinary blocks: > 2GB Free small blocks: 4544 bytes Free ordinary blocks: 5370 KiB Ordinary blocks: 1401 Small blocks: 81 Holding blocks: 3 (see system documentation for 'mallinfo' for meaning) --- qmem libfrr --- Type : Current# Size Total Max# MaxBytes Buffer : 6 24 160 6 160 Buffer data : 1 variable 4120 8 32960 Host config : 8 variable 304 8 304 Completion item : 0 variable 0 1 24 Command Tokens : 12944 72 932672 12966 934272 Command Token Text : 9273 variable 323864 9294 330096 Command Token Help : 9273 variable 223544 9294 224048 Command Argument : 2 variable 48 24 576 Command Argument Name : 2248 variable 53984 2269 54488 Command Match Stack : 0 variable 0 125 3000 Lexer token (temporary) : 0 variable 0 2 64 Access List : 1 56 56 1 56 Access List Str : 1 14 24 1 24 Access Filter : 2 112 240 2 240 RCU thread : 2 128 272 2 272 FRR POSIX Thread : 4 variable 320 4 320 POSIX sync primitives : 4 variable 192 4 192 Graph : 41 8 984 42 1008 Graph Node : 15224 32 611040 15226 611120 Hash : 703 variable 34376 703 34376 Hash Bucket : 355834 32 14239008 360167 14412216 Hash Index : 352 variable 5645048 353 7746280 Interface : 14 272 3920 14 3920 Connected : 25 48 1416 25 1416 Link List : 159 40 6408 163 6576 Link Node : 680 24 16512 807 19608 Temporary memory : 217 variable 11656 229 12168 Bitfield memory : 2 variable 10256 2 10256 Nexthop : 237 152 36200 237 36200 Northbound Node : 258 1192 307536 258 307536 Northbound Configuration : 2 24 48 2 48 Northbound Configuration Entry: 372 1032 384144 372 384144 Prefix List : 10 88 880 10 880 Prefix List Str : 10 variable 240 10 240 Prefix List Entry : 44 136 6048 44 6048 Prefix List Trie Table : 28 4096 114912 28 114912 Prefix : 25 56 1480 25 1480 Privilege information : 3 variable 136 3 136 Ring buffer : 62 variable 3048992 62 3048992 Route map : 22 120 2640 22 2640 Route map name : 171 variable 4104 172 4128 Route map index : 92 152 14464 92 14464 Route map rule : 201 40 8088 201 8088 Route map rule str : 192 variable 4944 192 4944 Route map compiled : 267 variable 6408 267 6408 Route map dependency : 42 24 1008 42 1008 Route map dependency data : 75 16 1800 75 1800 Skip List : 2 56 112 2 112 Skip Node : 2 160 336 2 336 Skiplist Counters : 2 68 144 2 144 Socket union : 44 112 5280 44 5280 Stream : 49 variable 2602472 33800 7476528 Stream FIFO : 62 64 4464 62 4464 Route table : 144 56 8128 144 8128 Route node : 56 120 6880 56 6880 Thread : 79 160 13288 110 18544 Thread master : 12 variable 50352 12 50352 Thread Poll Info : 6 8192 49200 6 49200 Thread stats : 29 96 3032 29 3032 Typed-hash bucket : 53 variable 6304328 53 6304328 Typed-heap array : 1 576 584 1 584 Vector : 30532 24 733952 30537 734072 Vector index : 30532 variable 963376 30537 963880 VRF : 1 216 216 1 216 VRF bit-map : 3 8 72 3 72 VTY : 5 variable 61256 5 61256 VTY server : 2 32 96 2 96 VTY output buffer : 0 1038 0 1 1048 VTY history : 1 variable 24 10 240 Work queue : 3 144 456 3 456 Work queue item : 0 24 0 7 168 Work queue name string : 3 variable 72 3 72 YANG module : 5 48 296 5 296 YANG data structure : 0 1032 0 1 1048 Zclient : 2 3144 6288 2 6288 Redistribution instance IDs : 6 2 144 6 144 log thread-local buffer : 2 24608 49232 2 49232 --- qmem logging subsystem --- Type : Current# Size Total Max# MaxBytes log file target : 1 88 88 1 88 log file name : 1 22 24 1 24 syslog target : 1 56 56 1 56 --- qmem bgpd --- Type : Current# Size Total Max# MaxBytes Peer KeepAlive Timer : 22 24 528 22 528 BGP Peer pthread Conditional : 1 48 56 1 56 BGP Peer pthread Mutex : 1 40 40 1 40 Mac Hash Entry : 7 16 168 7 168 Mac Hash Entry Intf String : 14 variable 336 15 360 BGP instance : 25 variable 10760 25 10760 BGP listen socket details : 2 144 304 2 304 BGP peer : 53 20864 1106216 53 1106216 BGP peer hostname : 70 variable 1920 71 1960 Peer group : 4 64 288 4 288 BGP Peer group hostname : 4 9 96 4 96 Peer description : 26 variable 624 26 624 BGP peer af : 26 80 2288 26 2288 BGP update group : 22 104 2288 22 2288 BGP update subgroup : 22 240 5456 22 5456 BGP packet : 22 56 1232 32 1792 BGP attribute : 273963 320 89860312 274031 89882616 BGP aspath : 78060 40 3122544 78090 3123744 BGP aspath seg : 78091 24 1874856 78121 1875576 BGP aspath segment data : 78091 variable 2015480 78121 2016552 BGP aspath str : 78061 variable 5009240 78091 5012424 BGP table : 87 56 4872 87 4872 BGP node : 607105 192 121441880 607185 121457896 BGP route : 558907 136 76071688 558986 76082432 BGP ancillary route info : 1 432 440 1 440 BGP connected : 16 4 384 16 384 BGP static : 7 152 1064 7 1064 BGP adv attr : 0 24 0 12594 302368 BGP adv : 0 64 0 56940 4099760 BGP synchronise : 22 48 1232 22 1232 BGP adj in : 572571 48 34469784 572650 34478776 BGP adj out : 514723 96 53843848 514921 53864328 BGP multipath info : 679 48 38024 679 38024 BGP AS list : 1 40 40 1 40 BGP AS filter : 2 48 112 2 112 BGP AS filter str : 2 variable 48 2 48 community : 532 40 21312 534 21408 community val : 532 variable 18624 534 18672 community str : 520 variable 4264000 521 4272216 extcommunity : 53 40 2136 54 2192 extcommunity val : 53 variable 1352 54 1392 extcommunity str : 53 variable 7720 53 7720 community-list : 35 56 1976 35 1976 community-list name : 35 variable 840 35 840 community-list entry : 35 56 2120 35 2120 community-list config : 32 variable 768 32 768 community-list handler : 1 120 136 1 136 BGP Process queue : 0 32 0 7 280 BGP transit attr : 5 24 120 5 120 BGP transit val : 5 25 200 5 200 BGP nexthop : 237 184 43608 237 43608 BGP regexp : 34 64 2448 34 2448 BGP own address : 16 64 1152 16 1152 BGP Filter Information : 118 variable 2896 120 2944 Large Community : 143792317 40 5751753192 143792318 5751753232 Large Community display string: 149 variable 9128 149 9128 Large Community value : 143792317 variable 3451044872 143792318 3451044896 BGP EVPN MH Information : 1 56 72 1 72 Software Version : 2 16 48 2 48 BGP Martian Addr Intf String : 16 variable 384 16 384 BGP PBR Context : 1 32 40 1 40 BGP interface context : 14 4 336 14 336 BGP EVPN instance information : 1 56 56 1 56 --- qmem rfapi --- Type : Current# Size Total Max# MaxBytes NVE Configuration : 1 2984 2984 1 2984 RFAPI Generic : 1 296 296 1 296 RFAPI Import Table : 1 208 216 1 216
IPv4 Unicast Summary (VRF default): BGP router identifier XXX.XXX.XXX.XXX, local AS number XXXXXX vrf-id 0 BGP table version 128817 RIB entries 235456, using 43 MiB of memory Peers 13, using 265 KiB of memory Peer groups 4, using 256 bytes of memory
IPv6 Unicast Summary (VRF default): BGP router identifier XXX.XXX.XXX.XXX, local AS number XXXXXX vrf-id 0 BGP table version 254868 RIB entries 371565, using 68 MiB of memory Peers 13, using 265 KiB of memory Peer groups 4, using 256 bytes of memory
The OS is linux debian bookworm (all is upgrade to current stable). FRR is installed from package ( repository: https://deb.frrouting.org/frr)
The bgpd system memory usage keep growing until it not consume all the available memory (more than 25G after a hour or two) and kernel's oom-killer just kill the process. From what I see largest memory usage is from large community structures, in the config bgpd do set and match on every peer and peer-group (in the route maps).
I tried to debug with valgrind but it is hard, because that router is in production environment.
can we see the routemaps surrounding large community usage?
Sure, I send you the complete bgpd.conf on your email, because it could be related and to the other config options I have no idea yet. Just spot bgpd crashing periodically couple of hours ago and so far it is all the debug info I have...
More info:
- The same memory leak exists in the stocked debian frr version: frr/stable,stable-security,now 8.4.4-1.1~deb12u1 amd64. With very similar config (but not same) on version 8.1 seems to work normal.
- Removing all 'additive' sets (keep sets but not on additive basis - clearing tags and set new lc/c tags) on large-community and on community in the route maps, drastically decreasing memory usage, also drastically decrease the rate of increasing consumed memory over the time. But I still observing increase ram usage (before bgpd was killed by the oom-killer for 1.5-2 hours, now will be approximately after 60-80 hours ). More puzzling is that 'additive' sets on normal community tags affects memory usage on large-community structures (by the 'show memory' command).
- I'm almost sure the problem do not come from match, it come from set. Run same match logic with same route maps on 9.0.1 with more peers and more prefixes, and there are no problems, the memory usage is 400k and not increasing to 30G :)
More debug info:
The memory leak come between versions 8.3.1 and 8.4.0 !
Here are some tests Same config file, same peers same ipv4/ipv6 prefixes (approximately +/- couple from half a million prefixes ).
# show memory
version 8.3.0
community : 500 40 20000 502 20080 community val : 500 variable 17808 502 17968 community str : 480 8192 3936000 481 3944200 Large Community : 1892 40 75696 1893 757368.3.1
community : 500 40 20016 501 20056 community val : 500 variable 17808 501 17944 community str : 480 8192 3936000 480 3936000 community-list : 35 56 1960 35 1960 community-list name : 35 variable 840 35 840 community-list entry : 35 56 2072 35 2072 community-list config : 32 variable 768 32 768 community-list handler : 1 120 136 1 136 Large Community : 1891 40 75640 1892 756808.4.0
community : 500 40 20000 501 20056 community val : 500 variable 17776 501 17912 community str : 482 8192 3952400 482 3952400 community-list : 35 56 1960 35 1960 community-list name : 35 variable 840 35 840 community-list entry : 35 56 2152 35 2152 community-list config : 32 variable 768 32 768 community-list handler : 1 120 136 1 136 Large Community : 116733 40 4680856 116734 4680896All statistics I am showing are collected after all neighbors was connected and all prefixes accepted/announced !
I dont know yet which part of the code cause memory leak when just set large-community, but after half a day compile and change version I can confirm the memory leak exists in all frr versions starting from 8.4.0. In first look it seems no changes in bgp_lcommunity.c and bgp_lcommunity.h. If anyone have some idea or memory what could cause this will greatly appreciate to share (will save me a lot of time to debug and reinvent the hot water :).
After a lot of lurking and testing found the commit which makes the problem:
( https://github.com/FRRouting/frr/commit/8cb56fb2738605c3babe0fc191339fbb52da8655 ).
In my config frr sets large community on the route map associated to default-originate . Maybe that cause memory leak and constantly very fast increasing LC structures , maybe that patch mess and another route maps I dont know yet (not very easy to change config logic in production environment just for testing).
Meanwhile maybe someone more familiar (I will have to learn and get know all the code around first) can take a look over that commit and may have idea what is wrong ?
UPDATE: Finally I managed to simulate the leak in test environment. Here is a simple config:
! ! Zebra configuration saved from vty ! 2024/02/04 10:25:24 ! ! loaded from 8.2.2 frr version 9.0.1 frr defaults traditional ! hostname AS1 password tt enable password tt log file /var/log/frr/bgpd.log ! ! ! router bgp 65501 bgp router-id 10.185.1.31 no bgp ebgp-requires-policy no bgp default ipv4-unicast no bgp graceful-restart notification bgp graceful-restart-disable no bgp network import-check neighbor 10.185.1.33 remote-as 65503 neighbor 10.185.1.33 description R1 neighbor 10.185.1.33 solo ! address-family ipv4 unicast network 10.1.2.0/24 network 10.1.3.0/24 network 10.1.4.0/24 neighbor 10.185.1.33 activate neighbor 10.185.1.33 default-originate route-map DEFR neighbor 10.185.1.33 soft-reconfiguration inbound neighbor 10.185.1.33 route-map T4-IN in neighbor 10.185.1.33 route-map T4-OUT out exit-address-family ! address-family ipv6 unicast exit-address-family ! exit ! access-list telnet-filter seq 5 permit 127.0.0.1/32 access-list telnet-filter seq 10 deny any ! bgp large-community-list standard defroute seq 1 permit 65503:300:300 ! ! route-map T4-IN permit 10 set large-community 65503:500:330 additive exit ! route-map T4-OUT permit 5 match large-community defroute set large-community 65501:300:330 exit ! route-map T4-OUT permit 10 set large-community 65501:400:330 exit ! route-map DEFR permit 10 set large-community 65501:300:300 exit ! ! line vty access-class telnet-filter exit !
After established connection with 10.185.1.33:
AS1# show memory ... Large Community : 28 40 1136 28 1136 Large Community display string: 5 variable 904 5 904 Large Community value : 28 variable 912 28 912 ... AS1# clear bgp ipv4 10.185.1.33 wait to connection with the peer to be establish AS1# show memory ... Large Community : 35 40 1416 35 1416 Large Community display string: 5 variable 904 5 904 Large Community value : 35 variable 1080 35 1264 ...
As you can see structures for LC are increased, and something is not freed. This leak apply to all frr versions after commit 8cb56fb2738605c3babe0fc191339fbb52da8655 . With snapshot frr-455b860f9db5e5be186a7d1606283b3a02989c64.zip (right before 8cb56 commit) I dont see increasing of the memory because Large Communities.
More debug:
The memory leak comes when use clause "set" in default-originate route map policy and is not present only when setting large-community. Setting community string in the default-originate RM also trigger huge memory leak (community structures start increasing). The worst is that once you put "set" clause in the route-map associated to default-originate the leak is triggered. After that nor delete the route-map itself nor removing route-map from the neighbor section stop increasing the memory in the bgpd process, the only you can do is to restart all frr daemons. Also I believe the things get worse if default route is learned and from other neighbor (even it is filtered on the neighbor input with prefix-list). I experienced and very strange behaviours when changing set clauses in the RM associated with default-originate. All can be simulated with very simple config (such I posted above) in a controlled environment.
I manually remove commit 8cb56fb2738605c3babe0fc191339fbb52da8655 from current stable 9.1 and for more than 3 hours dont observe massive memory increase of the bgpd process. Of course changes on the fly in default-originate RM do not work as expected, but frr is usable and stable (for now). Maybe it is good to revert that commit and update all bin packages from version 8.3.2 up to the current stable. I believe many routers in the internet which uses frr announce default-originate and not few of them uses route-map with "set" clauses. I will keep looking and knowing the code, but it is very complicated and unclear. I dont think there is a simple quick fix, maybe there is a need of total rewrite of the default-originate RM algorithm. Or someone really who knows the work logic in details to take a look.
I'll test this and let you know.
Thank you very much Donatas ! If you have idea for a fix or make a patch, I can test it against couple versions in different environments (even in production where bgpd is very heavy loaded and complicated configs).
To saiarcot896: There are a lot of fixes for different leaks from 8.5.1 to 9.1. In all my tests, with versions from 8.4 to 9.1 I observe different memory usage and behaviour and 9.1 seems to consume least memory. Can you compile frr from source in your test environment ? If you can repeat your tests with frr 9.1 and if you observe same leak try with http://i.bglans.net/frr-stable-9.1-without_8cb56fb.tar.gz (it is standart 9.1 downloaded a week ago, and removed commit 8cb56fb2738605c3babe0fc191339fbb52da8655, that I test for 6 days and dont observe leaks in production - couple of milion ipv4+ipv6 internet routes, couple ipv4+ipv6 default routes received/announced/generated, with 15-20 peers), that will help us a lot. Also is there a default route (0.0.0.0/0 or ::/0) among these 6400 routes you are testing with ?
I think maybe have an idea what cause the leak.... In bgpd/bgp_updgrp_adv.c in the function: subgroup_default_originate
new_ret = route_map_apply_ext(
peer->default_rmap[afi][safi].map,
bgp_dest_get_prefix(dest), pi, &tmp_pi,
&new_pref);
This call route_map_apply_ext from lib/routemap.c ...
/*
* set cmds return RMAP_OKAY or
* RMAP_ERROR. We do not care if
* set succeeded or not. So, ignore
* return code.
*/
(void)(*set->cmd->func_apply)(
set->value, prefix, set_object);
which call route_set_lcommunity -> lcommunity_dup -> lcommunity_new -> calloc new memory because in subgroup_default_originate we treat attr like blank new structure which need to be first initialized.
Seems subgroup_default_originate functions is also called when there are withdraw/announce of the default route learned and from a peer. The check for applying of the default-originate route map ( if (peer->default_rmap[afi][safi].name)... ) is done (only by "if there is a name of the route-map" ?!? not by any real change or RM name change ?) every time in the if scope and not all pointers are freed ( call bgp_attr_flush() ) in these tmp structures, also there is mess how they are used. I dont understand what is the idea behind the "pref" pointer some sort of preference or ?
At all It is posible to be done some properly free, but this is ugly and truly I think its not right. The logic is wrong, and maybe some functionality will not work this way. By me and from user's perspective the logic should be this:
There are 2 ways to send default route to a peer:
- Learn it from other peer and re-announce. There default route should be treated like every other prefix.
- Own generate and announce (neighbor XXX default-originate). default route should be treated as a high weight and highest priority (like a local network), do what default-originate route-map do. Put it in the peer flow, and after that threat it like any other prefix (from this stage to the end for 1 and 2 apply same logic)
Finally call subgroup_default_originate() which set attrs and need flags (from output route map if there are such), related with the default route which to be announce/withdraw to the peer.
Please correct me if I am wrong somewhere in the logic or missing something
@IvayloJ did you check this patch https://github.com/FRRouting/frr/pull/15466?
Yes, already integrated Donald's patch in my testing 9.1 but it will not resolve this leak. I know what cause it, and testing since yesterday ugly (most likely and not very proper) but fast fix. The problem is in subgroup_default_originate(). When call route_map_apply_ext() it allocate new memory for data inside the stucture tmp_pi.attr (tmp_pi.attr = tmp_attr), which we lost because never free. Here is my test change in bgp_updgrp_adv.c -> subgroup_default_originate()
new_ret = route_map_apply_ext(
peer->default_rmap[afi][safi].map,
bgp_dest_get_prefix(dest), pi, &tmp_pi,
&new_pref);
if (new_ret == RMAP_PERMITMATCH) {
if (new_pref
So far can confirm few things:
- It will stop this leak (https://github.com/FRRouting/frr/issues/14828).
- Tiny memory leak will still be present (when changing sets in default-originate route map, the memory containing old set values will be lost and not freed).
- Another tiny memory leak still there if delete or change the name of route-map associated with default-originate, some memory will not be released.
- Self generated default routes (ipv4/ipv6) are out of the output route-map flow.
- Apply/change of default-originate route-map is working.
And now the long part :)
During my tests and debug I put simple zlog_err("move here"); on top of the subgroup_default_originate(). Seems it is very "popular" function, it executes on every announce/withdraw/update message we receive and announce no matter if it is default route (self generated and received) or some other prefix. Which explain why I observed huge memory leak with many and noisy peers. That also explain https://github.com/FRRouting/frr/issues/15459 and seems Donald's commit will resolve it.
From what see in the comments I think subgroup_default_originate() have been written initially to handle only default-originate very basic usage. With the time rest of the code logic was changed , subgroup_default_originate() do not follow these changes and its usage is wrong now. It can be completely scraped and its function to be handled by default flow functions and default-originate + default-originate route-map functions to be handled in early stages in the code. Or it can stay but default-originate + the route map to it, to be removed from there, and handled in separate function called from updgrp_policy_update_walkcb() && update_group_default_originate_route_map_walkcb(). Also when we set attr for the self generated 0.0.0.0/0 and ::/0 we should use bgp_attr_default_set() the associated route map to default-originate (if there are such) should only change this default attrs. Also this way will gain little speed and optimization, because the time spend in subgroup_default_originate() will be less.
In the current situation there are maybe and other potential problems:
for (pi = bgp_dest_get_bgp_path_info(dest); pi;
pi = pi->next) {
struct attr tmp_attr = attr;
tmp_pi.attr = &tmp_attr;
So far as my knowledge (very basic) are over the compilers it depending on versions, bugs compile optimization options and e.t.c. on each iteration in the loop you may or may not get new memory for the "tmp_attr" , this memory may be freed on the scope exit or on the function exit (again differ from compiler to compiler). Which means potential stack crash (depending how much stack memory the thread/process can use) Also it is not very clear if the all memory of tmp_attr will be zeroed or will be filled with some random data in it (the compiler may or may not generate assembler code which will copy the all attr in tmp_attr, may copy only populated data in the attr). So I think best will be to get out of the loop scope declaration, zeroing ( memset(&tmp_attr,0,sizeof(struct attr));), assign the pointer tmp_pi.attr = &tmp_attr , and in the loop do memcpy(&tmp_attr, &attr, sizeof(struct attr)); on each iteration. Yes longer for write but much safer for use :). Have to check and "attr" family functions in attr.c , maybe to use them is best approach for work with struct attr. Also maybe is good to have signalling when there are change in the default-originate route map, because we dont have such.
@ton31337 After all my tests and debugs , maybe have a solution for this issue. It is not perfect, but at all is working and is better than to have this huge leak. When you have a time can you create a patch and backport it all versions from 8.4 above ? Now (9.1 current stable) in bgpd/bgp_updgrp_adv.c (from line 889 to line 934)
if (peer->default_rmap[afi][safi].name) {
struct bgp_path_info tmp_pi = {0};
tmp_pi.peer = bgp->peer_self;
SET_FLAG(bgp->peer_self->rmap_type, PEER_RMAP_TYPE_DEFAULT);
/* Iterate over the RIB to see if we can announce
* the default route. We announce the default
* route only if route-map has a match.
*/
for (dest = bgp_table_top(bgp->rib[afi][safi_rib]); dest;
dest = bgp_route_next(dest)) {
if (!bgp_dest_has_bgp_path_info_data(dest))
continue;
for (pi = bgp_dest_get_bgp_path_info(dest); pi;
pi = pi->next) {
struct attr tmp_attr = attr;
tmp_pi.attr = &tmp_attr;
new_ret = route_map_apply_ext(
peer->default_rmap[afi][safi].map,
bgp_dest_get_prefix(dest), pi, &tmp_pi,
&new_pref);
if (new_ret == RMAP_PERMITMATCH) {
if (new_pref peer_self->rmap_type = 0;
Change with:
/* During my tests and debug, spot a situation where we have .name but
* not and .map (almost every time on the start of the process).
* Maybe init is in other thread which starts slower. Don know how to
* handle such situation, but in all cases it is safer to enter.
*/
if (peer->default_rmap[afi][safi].name && peer->default_rmap[afi][safi].map) {
struct bgp_path_info tmp_pi;
struct attr tmp_attr;
struct bgp_dest *pm_dest = NULL;
struct bgp_path_info *pm_pi = NULL;
memset(&tmp_pi, 0, sizeof(struct bgp_path_info));
memset(&tmp_attr, 0, sizeof(struct attr));
tmp_pi.peer = bgp->peer_self;
tmp_pi.attr = &tmp_attr;
SET_FLAG(bgp->peer_self->rmap_type, PEER_RMAP_TYPE_DEFAULT);
/* Let's try to get lowest rule number in the route-map
* Put it in IFs statements to protect from illegal reads.
*/
int lrm_rule = 0;
if( peer->default_rmap[afi][safi].map->head )
if( peer->default_rmap[afi][safi].map->head->pref )
lrm_rule = peer->default_rmap[afi][safi].map->head->pref;
/* This semi-conditional logic is sh.. - only on prefix list.
* Much more logical/flexible/usable from user point
* is to handle default route announce in the peer[s] output
* route-map, in the default-originate route-map only to
* permit set clauses for self generated default route.
* But what is that is. Lets iterate over the RIB (can be
* millions times) to check if we will have permit match
* to announce default route.
*/
for (dest = bgp_table_top(bgp->rib[afi][safi_rib]); dest;
dest = bgp_route_next(dest)) {
if (!bgp_dest_has_bgp_path_info_data(dest))
continue;
for (pi = bgp_dest_get_bgp_path_info(dest); pi;
pi = pi->next) {
/* route_map_apply_ext() allocates new memory
* in tmp_pi.attr ! Better to write new func
* with the same logic (much faster as well)
* to do only check, and when we know what to
* do then to call route_map_apply_ext()
*/
new_ret = route_map_apply_ext(
peer->default_rmap[afi][safi].map,
bgp_dest_get_prefix(dest), pi, &tmp_pi,
&new_pref);
if (new_ret == RMAP_PERMITMATCH) {
/* pref is the rule number in the RM
* On which we have match
*/
if (new_pref default_rmap[afi][safi].map,
bgp_dest_get_prefix(pm_dest), pm_pi, &tmp_pi,
&new_pref);
bgp_attr_flush(new_attr);
new_attr = bgp_attr_intern(&tmp_attr);
bgp_attr_flush(&tmp_attr);
subgroup_announce_reset_nhop(
(peer_cap_enhe(peer, afi, safi)
? AF_INET6
: AF_INET),
new_attr);
} else
ret = RMAP_DENYMATCH;
bgp->peer_self->rmap_type = 0;
This patch also will have significant speed impact in cases where there are no match clause in the default-originate RM. If you have idea how to check what is the condition of the first rule (permit/deny) will be great. So if for example we have in the config.
route-map XXX deny 1 exit
where obviously is pointless to spin these loops, ( in case of full internet bgp tables can take ~ 2 millions * RM-rules-number iterations).