frr icon indicating copy to clipboard operation
frr copied to clipboard

Bgp nexthop group optimisations

Open pguibert6WIND opened this issue 1 year ago • 36 comments

This patch set proposes to bring BGP changes. Among the reasons of those changes:

  • split the BGP information between prefix and nexthop. This change applies to single path routes, but also multiple path routes (with addpath functionality).
  • reduce the message exchanged between BGP and ZEBRA during failover scenarios

Link: https://github.com/pguibert6WIND/SONiC/blob/proposal_bgp/doc/pic/bgp_pic_arch_doc.md#9-frrouting-ipc-messaging-from-bgp-to-zebra

pguibert6WIND avatar Mar 05 '24 15:03 pguibert6WIND

ci:rerun

pguibert6WIND avatar Mar 21 '24 12:03 pguibert6WIND

ci:rerun

pguibert6WIND avatar Mar 21 '24 13:03 pguibert6WIND

CI:rerun Rerun after fixing git access on CI infra

mwinter-osr avatar Mar 21 '24 14:03 mwinter-osr

ci:rerun

pguibert6WIND avatar Mar 22 '24 20:03 pguibert6WIND

ci:rerun

pguibert6WIND avatar Mar 23 '24 12:03 pguibert6WIND

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions[bot] avatar Mar 26 '24 14:03 github-actions[bot]

⚠️ The sha of the head commit of this PR conflicts with #15636. Mergify cannot evaluate rules on this PR. ⚠️

mergify[bot] avatar Mar 29 '24 16:03 mergify[bot]

ci:rerun

pguibert6WIND avatar Mar 31 '24 06:03 pguibert6WIND

ci:rerun

pguibert6WIND avatar Mar 31 '24 20:03 pguibert6WIND

ci:rerun git.netdef was unreachable

pguibert6WIND avatar Apr 01 '24 07:04 pguibert6WIND

ci:rerun

pguibert6WIND avatar Apr 01 '24 12:04 pguibert6WIND

This seems not rebased since https://github.com/FRRouting/frr/pull/15636/commits/e20faa9fe0dee04d46d67cfabf195c7eb071dcbf is merged, can we do it to avoid unnecessary redundant review?

ton31337 avatar Apr 03 '24 06:04 ton31337

I started reviewing, but I see this PR is not yet updated and/or done finally? E.g. why 33daadb is fixing b98caac while in the same pull request? Can't we have a clean history from the beginning since this is the same PR?

There are more places like this.

not so much. a couple of fixes have been squashed to the "single path" commit.

pguibert6WIND avatar Apr 03 '24 12:04 pguibert6WIND

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions[bot] avatar Apr 04 '24 18:04 github-actions[bot]

Still some ASAN issues, mainly related to the way the BGP NHG objects are stored. When refreshing the list of BGP NHG (based on an event), I detach some nexthops (that are not present anymore), and also I have to do some garbage with the remaining groups that depend of those nexthops.

This change breaks the NHG hash list, and some leaks are visible... Previously I was using two distinct hash lists, and it was working well. I tried to address it with a single hash list, but it complexifies the code.

So I will come back with the double hash list.

pguibert6WIND avatar Apr 09 '24 12:04 pguibert6WIND

about bgp l3vpn to bgp vrf crash, this is an OOM

#9  0xb7db5e9d in memory_oom (size=154784, name=0xb7e5dbf3 "Stream")
    at ../lib/log.c:310
#10 0xb7db9dac in mt_checkalloc (size=154784, ptr=0x0, 
    mt=0xb7eef460 <MTYPE_STREAM>) at ../lib/memory.c:90
#11 qmalloc (mt=0xb7eef460 <MTYPE_STREAM>, size=154784) at ../lib/memory.c:100
#12 0xb7df6aca in stream_new (size=154768) at ../lib/stream.c:95
#13 0x0047412c in zsend_redistribute_route (cmd=31, client=0xb7f910, 
    rn=0x1e0a220, re=0x1e20b80, is_table_direct=false)
    at ../zebra/zapi_msg.c:614
#14 0x0045a927 in redistribute_update (rn=0x1e0a220, re=0x1e20b80, prev_re=0x0)
    at ../zebra/redistribute.c:254
---Type <return> to continue, or q <return> to quit---
#15 0x004b387e in rib_process_result (ctx=0x6621f980)
    at ../zebra/zebra_rib.c:2220
#16 rib_process_dplane_results (thread=0xbfe13df4) at ../zebra/zebra_rib.c:5003
#17 0xb7e04294 in event_call (thread=0xbfe13df4) at ../lib/event.c:2011
#18 0xb7dacba1 in frr_run (master=0x7928f0) at ../lib/libfrr.c:1217
#19 0x0044483e in main (argc=<optimized out>, argv=<optimized out>)
    at ../zebra/main.c:519

pguibert6WIND avatar Apr 10 '24 16:04 pguibert6WIND

ci:rerun

pguibert6WIND avatar Apr 18 '24 19:04 pguibert6WIND

So first, thanks for providing a detailed description of what you're trying to do - that doesn't happen often enough, really, and it's very helpful. At one level, it'd fine with me if bgp wants to manage its own nhgs; that'd be ok if that kind of code change could be done effectively and correctly.

That said, I do not want some of what I've read there to happen. BGP does not hold the RIB; zebra does, and so zebra must continue to be involved in evaluating and validating nexthops. And this statement in section 9 of the document is a real problem:

"The expectation is that ZEBRA will handle the ZEBRA_NHG_DELETE operation, by removing the corresponding nexthop from the dataplane (which in its turn, will remove the dependent routes from the dataplane)."

I do not want zebra to take control of bgp's objects in this way, deleting them implicitly. I understand the temptation to compress the bgp->zebra communication, but that is a temptation that should be resisted in this case. For example, this would represent a tremendous work-magnification issue, where a single zapi message from bgp would trigger potentially thousands of events in zebra. We have enough trouble balancing work-rates in the system as it is now.

In the sonic context, we're talking about ways to try to improve the responsiveness of the system to certain events, like the loss of an ecmp path's nexthop. Some of that response may mean additional processing in zebra, but that will need to be carefully planned and coded so that we identify and address rate mismatches and work amplification.

mjstapp avatar Apr 25 '24 16:04 mjstapp

Thanks for the feedback Mark.

So first, thanks for providing a detailed description of what you're trying to do - that doesn't happen often enough, really, and it's very helpful. At one level, it'd fine with me if bgp wants to manage its own nhgs; that'd be ok if that kind of code change could be done effectively and correctly.

That said, I do not want some of what I've read there to happen. BGP does not hold the RIB; zebra does, and so zebra must continue to be involved in evaluating and validating nexthops. And this statement in section 9 of the document is a real problem:

"The expectation is that ZEBRA will handle the ZEBRA_NHG_DELETE operation, by removing the corresponding nexthop from the dataplane (which in its turn, will remove the dependent routes from the dataplane)."

I do not want zebra to take control of bgp's objects in this way, deleting them implicitly. I understand the temptation to compress the bgp->zebra communication, but that is a temptation that should be resisted in this case. For example, this would represent a tremendous work-magnification issue, where a single zapi message from bgp would trigger potentially thousands of events in zebra. We have enough trouble balancing work-rates in the system as it is now.

I agree I need to evaluate the impact on ZEBRA performances, at a large scale. I will share my results.

Regarding the functionality, I did some testing to ensure that a BGP NHG_ID is only used by BGP, and not by a static route for instance. This is a positive result that makes me think that if BGP wants to remove its NHG_ID, then only BGP routes will be impacted.

https://gist.github.com/pguibert6WIND/ad041717cb42207ff4b4ad0e61acf238

In the sonic context, we're talking about ways to try to improve the responsiveness of the system to certain events, like the loss of an ecmp path's nexthop. Some of that response may mean additional processing in zebra, but that will need to be carefully planned and coded so that we identify and address rate mismatches and work amplification.

pguibert6WIND avatar Apr 26 '24 07:04 pguibert6WIND

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions[bot] avatar Apr 26 '24 18:04 github-actions[bot]

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions[bot] avatar May 15 '24 10:05 github-actions[bot]

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions[bot] avatar May 16 '24 15:05 github-actions[bot]

This pull request has conflicts, please resolve those before we can evaluate the pull request.

lets first look at NHG recursivity in https://github.com/FRRouting/frr/pull/16028 before rebasing.

pguibert6WIND avatar May 17 '24 10:05 pguibert6WIND

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions[bot] avatar May 28 '24 14:05 github-actions[bot]

during the tests with full route, A CPU issue appears on a multihomed setup with BGP peering with a primary and backup nexthop. The primary root cause was due to 'asan' compilation was used and was consuming too much memory. I had a lot of swap. However, an optimisation may be welcome to handle thie high CPU usage. When the failover happens, the shell and vtysh are hardly reachable. The attached dataplane shows that the dataplane result handles takes a lot of time. On each route node installed/uninstalled, nht is called to evaluate other routes. It appears that if a default route is present, this prefix is systematically called as many times as there are some routes installed (whereas only one nht event is necessary).

graph1

The optimisation is related to the handling of NHG events in zebra. Extra code has been added. This extra code parses the list of route nodes and takes time. In addition of being added, the NHG events handling in zebra is optimised in that pull request too.

pguibert6WIND avatar May 31 '24 07:05 pguibert6WIND

Hi @mjstapp

I'm very concerned about some of what's here, and I discussed in a comment earlier. I think it would be more successful to split this work into multiple PRs:

1. make it possible for zebra to resolve recursive nexthop groups owned by daemons (as it resolves implicit groups that it creates and owns)

https://github.com/FRRouting/frr/pull/16028

2. enable BGP to create and use explicit nhgs for its routes. the policy implications will have to be carefully presented. NHT events trigger BGP to refresh its groups.

NHT information is presented https://github.com/FRRouting/frr/pull/15488/commits/1996c17b297f81b4a88f9bdb4d2d537e28a57193#diff-e8f99c58bc731e761209c2a21ee19d4a004718048d9517342b3875634ea0cd13R105

unless you want me to present something.

3. add your "zebra manages bgp's routes" proposal. as I've said, I think this is a mistake, but I think the discussion would be better if this aspect were separate from the other, more mechanical aspects.

A piece of code in zebra has been added in this pull request.

https://github.com/FRRouting/frr/pull/15488/commits/1e64cc648c7453c8c89e4aa0f60c4e4a151da865

pguibert6WIND avatar May 31 '24 07:05 pguibert6WIND

ci:rerun ospf6 issue?

pguibert6WIND avatar Jun 03 '24 08:06 pguibert6WIND

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions[bot] avatar Jun 18 '24 16:06 github-actions[bot]

This pull request 'reuses' the same kind of hierarchy that exists at zebra level with nhg_nexthop_group structures, but at BGP level with the bgp_nhg_cache structure. This comment to illustrates what is proposed for naming conventions:

currently:

struct bgp_nhg_nexthop_cache {
        uint16_t nexthop_num;
        struct zapi_nexthop nexthops[MULTIPATH_NUM];
};

struct bgp_nhg_group_cache {
        uint16_t group_num;
        uint32_t groups[MULTIPATH_NUM];
};

struct bgp_nhg_cache
{
#define BGP_NHG_FLAG_TYPE_GROUP      (1 << 3)
        uint16_t flags;

        union {
                struct bgp_nhg_nexthop_cache nexthops;
                struct bgp_nhg_group_cache groups;
        };
        struct bgp_nhg_connected_tree_head nhg_child_groups, nhg_parent_groups;
}

pguibert6WIND avatar Jun 25 '24 11:06 pguibert6WIND

new proposal

struct bgp_nhg_nexthop_cache {
        uint16_t nexthop_num;
        struct zapi_nexthop nexthops[MULTIPATH_NUM];
};

struct bgp_nhg_group_cache {
        uint16_t group_num;
        uint32_t groups[MULTIPATH_NUM];
};

struct bgp_nhg_cache
{
#define BGP_NHG_FLAG_TYPE_GROUP      (1 << 3)
        uint16_t flags;

        union {
                struct bgp_nhg_nexthop_cache nexthops;
                struct bgp_nhg_group_cache groups;
        };
        struct bgp_nhg_connected_tree_head nhg_depends, nhg_dependents;
}

pguibert6WIND avatar Jun 25 '24 16:06 pguibert6WIND