frr icon indicating copy to clipboard operation
frr copied to clipboard

zebra: uninstall nhg when no best re point to it

Open chiragshah6 opened this issue 7 months ago • 6 comments

Currently when upper layer protocols like ospf and static route sends the same route with different nexthops sequentially, zebra creates two different route_entry (re) and two distinct nhgs. First the static route and corresponding selected and installed, subsequently, the ospf route is installed Once zebra rib selects the ospf as best the static as non best, then the static route is not uninstalled from dplane. When networking restart happens at that point only the best NHG is installed. If user does query of nexthop-group nhe state compare prior to and after networking restart, the behavior is not symmetric. To make consistent, when zebra rib selects the best nhg to installed, adding a bit of logic to uninstall old route_entry's nhg if it is not selected as best by any of the route_entries then uninstall the nhg from kernel. This requires to keep best installed re's count in nhg struct.

Initial route points to static, route and NHG 34 is installed:

DUT# show ip route nexthop-group

   S>* 20.8.8.0/30 [115/0] (34) via 20.7.7.2, swp31s0, weight 1, 00:26:31
OSFPv2 come up installs the route, route and new NHG 48 is installed:

DUT# show ip route nexthop-group
    B>* 20.6.6.0/30 [20/0] (37) via 20.3.3.2, swp1s2, weight 1, 00:29:24
    S   20.6.6.0/30 [75/0] (34) via 20.7.7.2, swp31s0, weight 1, 00:29:30
    O>* 20.8.8.0/30 [110/8] (48) via 20.1.1.1, swp1s0, weight 1, 00:01:48
    S   20.8.8.0/30 [115/0] (34) via 20.7.7.2, swp31s0, weight 1, 00:29:30

Old NHG 34 is not removed kernel:

   DUT# show nexthop-group rib 34
   ID: 34 (zebra)
        RefCnt: 2
        Uptime: 00:33:27
        VRF: default
        Valid, Installed    <<<<<  installed
        Interface Index: 37
              via 20.7.7.2, swp31s0 (vrf default), weight

After fix: NHG 34 is retained in zebra from unistalled from dplane DUT# show nexthop-group rib 34

    ID: 34 (zebra)
         RefCnt: 2
         Uptime: 00:27:06
         VRF: default
         Valid
         Interface Index: 37
               via 20.7.7.2, swp31s0 (vrf default), weight 1

Signed-off-by: Chirag Shah [email protected]

chiragshah6 avatar May 27 '25 13:05 chiragshah6

hmm - I'd like to hear more about this: "Once zebra rib selects the ospf as best the static as non best, then the static route is not uninstalled from dplane."

that seems ... unexpected, I'd expect that the route "update" code was uninstalling the non-best route

mjstapp avatar May 27 '25 13:05 mjstapp

hmm - I'd like to hear more about this: "Once zebra rib selects the ospf as best the static as non best, then the static route is not uninstalled from dplane."

that seems ... unexpected, I'd expect that the route "update" code was uninstalling the non-best route

@mjstapp When an existing route's re (lets say static in the example captured in commit message) becomes non-best, its associated re becomes non best but nhe refcount is greater than 0 so it won't explicitly uninstall. zebra_nhg_uninstall_kernel -> dplane_nexthop_delete (which uninstalls) . This is the code path followed but given refcount is non zero, the dplane_nexthop_delete won't be called. I have captured detail explanation in commit/PR with example.

chiragshah6 avatar May 27 '25 16:05 chiragshah6

ah, ok - it's not the "static route" that isn't uninstalled, it's the nhg(s) used by the static route.

@mjstapp When an existing route's re (lets say static in the example captured in commit message) becomes non-best, its associated re becomes non best but nhe refcount is greater than 0 so it won't explicitly uninstall. zebra_nhg_uninstall_kernel -> dplane_nexthop_delete (which uninstalls) . This is the code path followed but given refcount is non zero, the dplane_nexthop_delete won't be called. I have captured detail explanation in commit/PR with example.

mjstapp avatar May 27 '25 16:05 mjstapp

ah, ok - it's not the "static route" that isn't uninstalled, it's the nhg(s) used by the static route.

@mjstapp When an existing route's re (lets say static in the example captured in commit message) becomes non-best, its associated re becomes non best but nhe refcount is greater than 0 so it won't explicitly uninstall. zebra_nhg_uninstall_kernel -> dplane_nexthop_delete (which uninstalls) . This is the code path followed but given refcount is non zero, the dplane_nexthop_delete won't be called. I have captured detail explanation in commit/PR with example.

That's Correct! Any of the upper layer protocol's re (route_entry) pointing to a particular NHG. If all the re become non-selected then uninstall NHG from kernel but retain in zebra as NHG is being reference by non selected res.

chiragshah6 avatar May 27 '25 22:05 chiragshah6

@mjstapp / @riw777 kindly please help review and let me know if further query. I have addressed all review comments and concerns.

chiragshah6 avatar Jun 09 '25 12:06 chiragshah6

I'm not convinced that this is something we should do. Adding yet-another refcount in this area just seems awkward. Is there some functional issue that the existing behavior leads to that needs to be fixed? Is there just some cosmetic concern?

@mjstapp there isn't any functional issue, I have tried to explained the rational using example in the commit message. There is a NHG left in dplane/kernel which is not used by any route. That is the problem we are trying to solve. The existing code does not handle it. Though one can say its a corner case but it is a bug why NHG should remained installed in kernel when Zebra decided to switch its best re to another NHG. Please take a look at the scenario outlined should be distinctly identify the issue. What other way you are thinking of solving this corner case? Kindly suggest.

chiragshah6 avatar Jun 10 '25 03:06 chiragshah6

Not needed.

riw777 avatar Jun 17 '25 15:06 riw777