frr icon indicating copy to clipboard operation
frr copied to clipboard

bgpd: do not crash when labels are empty

Open crosser opened this issue 7 months ago • 6 comments

bgp_evpn_path_info_get_l3vni() tries to find out l3vni associated with the path. However, under some circumstances, function bgp_evpn_path_info_labels_get_l3vni() may return NULL, and if it is passed to label2vni(), it causes abort(). bgpd crashes and is then restarted, which may lead to several seconds of lost connectivity to the host, until the daemon gets restarted and BGP sessions are reestablished.

This behaviour was observed in real life. Here is a partial stack trace:

/usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(_zlog_assert_failed+0xe9) [0x7f60dbff5299]
/usr/lib/frr/bgpd(bgp_evpn_mpath_has_dvni+0x90) [0x5649d89a9140]
/usr/lib/frr/bgpd(bgp_evpn_path_es_use_nhg+0x10b) [0x5649d89b12fb]
/usr/lib/frr/bgpd(bgp_zebra_announce+0x234) [0x5649d8a68214]

Bug-URL: #18678

crosser avatar Apr 17 '25 13:04 crosser

Please fix styling.

ton31337 avatar Apr 17 '25 15:04 ton31337

ci:rerun

RodrigoMNardi avatar Apr 17 '25 19:04 RodrigoMNardi

How do we get to a situation where a l3vni doesn't have a label? That seems to be the problem right? I'd like to fix that...

donaldsharp avatar Apr 17 '25 22:04 donaldsharp

How do we get to a situation where a l3vni doesn't have a label? That seems to be the problem right? I'd like to fix that...

Unfortunately I do not have a reproducer. It started to happen when our operations tried to bring up new hardware where they have slightly different connectivity arrangement. I needed to "do something" to unblock the rollout, and this spot seems worth fixing in any case, does it?

I could not comment on your assertion, as I have no clue what do mpls labels have to do with bgp evpn in the first place :joy:

That said, all that I see here in our environment points to some fault in dynamic reconfiguration:

  • Problem happens during "reload"
  • Problem does not show up anymore after the affected daemon is restarted
  • I have another case, when zebra segfaults during reload (and works fine after it is re-launched by watchfrr). I may have a reproducer for this one, after I am back from vacation (in the middle of May),
  • There is a complete breakage of dynamic reconfiguration of l3 bgp-evpn that happened in a later version, 10.1 (that's why we cannot upgrade beyond 10.0). It is conceivable a latent problem that exists in earlier versions too shows up in 10.1. (Though this one is only a speculation).

crosser avatar Apr 18 '25 09:04 crosser

so the part that interests me is that by putting this test in, instead of figuring out why the problem occurs, we are left with the situation where FRR may be doing the wrong thing elsewhere in the network and we just have not noticed it. This needs to be run to ground from my perspective. This is akin to possibly putting a bandaid on a cut to a artery. The problem is just going to pop up elsewhere in some negative fashion.

donaldsharp avatar Apr 21 '25 14:04 donaldsharp

In case it helps. What I noticed is the crash is not triggered since 10.1+, I've tested 10.1/10.1.3/10.2.2/10.3 with the sequence which could crash 10.0.1 or 10.0.3 almost every run, but can't trigger the crash with later version. I guess the bug is somehow fixed in later version.

xjtuwjp avatar Apr 21 '25 18:04 xjtuwjp

I don't know that we should push code into a current version to fix something in an older version--especially since I don't think these older versions are still supported (?).

riw777 avatar Aug 05 '25 14:08 riw777