frr bgpd: add input per-peer priority pkt queue

Under heavy loaded system with many connections, and many BGP updates received, the system has issues to maintain BGP connectivity with peers.

For instance, the wireshark capture indicates a BGP open packet is received, and is processed by the device many seconds after. Obviously, the remote peer had already flushed the connection..

The proposed fix attempts to create a second stream of buffers reserved for prioritary BGP packets: BGP OPEN, BGP NOTIF, BGP KEEPALIVE. When running the bgp_packet_receive() job, this prioritary fifo is dequeued before consuming the standard fifo queue.

Aug 05 '24 16:08 pguibert6WIND

I have reservations about this: if a system is so loaded that it is running seconds behind, as in the description, should it be prioritizing more connections and more work? what good would come from that?

Aug 05 '24 17:08 mjstapp

I'm pretty ok with making such a change. Although I really don't understand how what @pguibert6WIND has done here will actually solve the problem that is being stated. I do not see any code where we prioritize the special packet types to handle those immediately -vs- handling other peers already knee deep in reading in full tables. Can I be pointed at this decision? Additionally if a peer is already up in bgp_io.c we reset the keepalive timer for any packet we already receive, so I think the proiritizing of keepalive packets make no sense there. All peers eventually end up adding a connect->t_process_packet event, which is per peer and why I don't think this will work.

Aug 05 '24 17:08 donaldsharp

I'm pretty ok with making such a change. Although I really don't understand how what @pguibert6WIND has done here will actually solve the problem that is being stated. I do not see any code where we prioritize the special packet types to handle those immediately -vs- handling other peers already knee deep in reading in full tables. Can I be pointed at this decision? Additionally if a peer is already up in bgp_io.c we reset the keepalive timer for any packet we already receive, so I think the proiritizing of keepalive packets make no sense there. All peers eventually end up adding a connect->t_process_packet event, which is per peer and why I don't think this will work.

Typo: the main issue is related to BGP open messages that are processed too lately. This said, the same problem should happen with KEEPALIVE messages if user configures lower keepalive values.

Aug 05 '24 19:08 pguibert6WIND

I have reservations about this: if a system is so loaded that it is running seconds behind, as in the description, should it be prioritizing more connections and more work? what good would come from that?

scenario is a RR with 998 listening peers and 2 speaking peers, advertising 4M L3VPN prefixes.

Aug 05 '24 19:08 pguibert6WIND

You don't explain how this helps, given the way the system works. If I have 998 peers that means I possibly have 998 events that are before the new peers event to handle it. Each of those events can process a whole bunch of packets already. Those events are the ones that are not allowing us to get to the problem, as you are describing. Remember we have a event per peer to handle it's i/o, adding a priority queue for the peer doesn't help when I have to process the other events from other peers first( as that their input queues could be filled). To solve the problem from my perspective you would need to have all priority events handled first before any normal packet processing could occur. This implies some sort of equivalent to zebra's MetaQ handling. Finally, your answer about keepalives makes no sense to me as well. As that FRR currently counts any packets received as triggering the keepalive already. When we made this change I have not seen keepalive problems since then at all.

Is this code you have had internally for a long time and are now just forward porting?

Aug 05 '24 20:08 donaldsharp

You don't explain how this helps, given the way the system works. If I have 998 peers that means I possibly have 998 events that are before the new peers event to handle it. Each of those events can process a whole bunch of packets already. Those events are the ones that are not allowing us to get to the problem, as you are describing. Remember we have a event per peer to handle it's i/o, adding a priority queue for the peer doesn't help when I have to process the other events from other peers first( as that their input queues could be filled). To solve the problem from my perspective you would need to have all priority events handled first before any normal packet processing could occur. This implies some sort of equivalent to zebra's MetaQ handling. Finally, your answer about keepalives makes no sense to me as well. As that FRR currently counts any packets received as triggering the keepalive already. When we made this change I have not seen keepalive problems since then at all.

Is this code you have had internally for a long time and are now just forward porting?

This code is a trial done a few weeks ago. The impact during the test was not obvious, and we did not use it then. The intent here was to share the code for further review.

You are totally right that about the per peer queue case. thanks for pointing out this.

The processing of a given peer will not speed up the processing of an other peer. => I need to think of an alternate change.

About the bgp keepalive, I think this may happen.

If we take delay to process bgp updates for a given peer, and we also have bgp keepalives, the bgp keepalives packet will only be processed after the bgp updates. and the delay may take seconds..afaik, the keepalive thread is done for transmitting bgp keepalive, not receiving bgp keepalives.

Aug 06 '24 10:08 pguibert6WIND

moving it to draft to decide of the future of this commit:

to propose better solution to anticipate bgp open message handling. it becomes clear it is not significant to bgp open messages.
or only to focus on bgp keepalive (and bgp notifications).

Aug 06 '24 10:08 pguibert6WIND

I have reservations about this: if a system is so loaded that it is running seconds behind, as in the description, should it be prioritizing more connections and more work? what good would come from that?

I agree with @mjstapp here. I feel a better approach will be to prioritize the keep-alive messages instead, so that there's no flap of the existing sessions due to high amount of updates.

Aug 08 '24 07:08 pushpasis

frr frr copied to clipboard

bgpd: add input per-peer priority pkt queue

frr
frr copied to clipboard