sonic-swss icon indicating copy to clipboard operation
sonic-swss copied to clipboard

Arp Refresh changes

Open sumanbrcm opened this issue 4 years ago • 1 comments

What I did This is the change to arp refresh , details are provided in below sections . Why I did it SONiC depends upon the Linux kernel to manage the ARP/ND tables. SONiC then listens to ARP/ND events from the kernel and synchronizes the hardware as required. However, there are a number of problems with this: -

The kernel does not "see" the routed (in HW) through-traffic, and so cannot update its "hit bits" accordingly. Therefore the kernel may age out an entry that is still in use. The kernel also does not "see" the HW MAC aging process, and so does not know that a MAC address associated with an ARP/ND entry has been aged out, and so does not refresh it. This can result in traffic black holes for a "quiet" neighbor (i.e. one that does not transmit much). There is a further problem in MCLAG/ICCP setups whereby the response to an ARP/ND initiated by the kernel on one peer can go to the other peer. This eventually makes its way back across the ICCP control plane, but by then the kernel may have already aged out the entry. The current ARP Refresh process is implemented as a bash script, and cannot run fast enough to be effective at scale, requiring the network operator to set much higher aging timers than would otherwise be used. It's also a very inefficient use of system resources. So, the proposal here is to design and implement a much faster and more efficient instance of the ARP Refresh process.

How I verified it

  1. Verifications are done by adding arp entries dynamically and tcpdump verifications was done to check if arp request/reply are observed in accordance with the proposed design . Here are the details test logs . a. For arp (3 updates for 12.12.12.2 are shown in logs below , other arps/ more logs are not updated here) admin@sonic:~$ show arp Address MacAddress Iface Vlan

10.59.128.1 00:00:0c:9f:f4:68 eth0 - 12.12.12.2 00:10:94:00:00:05 Ethernet0 - 12.12.12.3 00:10:94:00:00:06 Ethernet0 - 12.12.12.4 00:10:94:00:00:07 Ethernet0 - 12.12.12.5 00:10:94:00:00:08 Ethernet0 - Total number of entries 5

admin@sonic:~$ sudo tcpdump -ei Ethernet0 19:27:40.364309 3c:2c:99:2d:84:35 (oui Unknown) > 00:10:94:00:00:05 (oui Unknown), ethertype ARP (0x0806), length 42: Request who-has 12.12.12.2 tell 12.12.12.1, length 28 19:27:40.364666 00:10:94:00:00:05 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype ARP (0x0806), length 60: Reply 12.12.12.2 is-at 00:10:94:00:00:05 (oui Unknown), length 46

19:32:40.397044 3c:2c:99:2d:84:35 (oui Unknown) > 00:10:94:00:00:05 (oui Unknown), ethertype ARP (0x0806), length 42: Request who-has 12.12.12.2 tell 12.12.12.1, length 28 19:32:40.397380 00:10:94:00:00:05 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype ARP (0x0806), length 60: Reply 12.12.12.2 is-at 00:10:94:00:00:05 (oui Unknown), length 46

19:37:40.428211 3c:2c:99:2d:84:35 (oui Unknown) > 00:10:94:00:00:05 (oui Unknown), ethertype ARP (0x0806), length 42: Request who-has 12.12.12.2 tell 12.12.12.1, length 28 19:37:40.428622 00:10:94:00:00:05 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype ARP (0x0806), length 60: Reply 12.12.12.2 is-at 00:10:94:00:00:05 (oui Unknown), length 46

admin@sonic:~$ sudo tcpdump -ei Ethernet0

b. For ndp (3 updates for 2100::2 are shown in logs below , other ndps/ more logs are not updated here) admin@sonic:~$ show ndp | head Address MacAddress Iface Vlan Status


2100::2 00:10:94:00:00:09 Ethernet0 - REACHABLE 2100::3 00:10:94:00:00:0a Ethernet0 - REACHABLE 2100::4 00:10:94:00:00:0b Ethernet0 - REACHABLE 2100::5 00:10:94:00:00:0c Ethernet0 - REACHABLE fe80::1a5a:58ff:fe17:c2e0 18:5a:58:17:c2:e0 eth0 - STALE fe80::1a5a:58ff:fe18:f720 18:5a:58:18:f7:20 eth0 - STALE fe80::1a5a:58ff:fe19:620 18:5a:58:19:06:20 eth0 - STALE fe80::3e2c:99ff:fe2d:8735 3c:2c:99:2d:87:35 eth0 - STALE

11:55:46.283420 3c:2c:99:2d:84:35 (oui Unknown) > 33:33:ff:00:00:02 (oui Unknown), ethertype IPv6 (0x86dd), length 86: fe80::3e2c:99ff:fe2d:8435 > ff02::1:ff00:2: ICMP6, neighbor solicitation, who has 2100::2, length 32 11:55:46.283763 00:10:94:00:00:09 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype IPv6 (0x86dd), length 86: 2100::2 > fe80::3e2c:99ff:fe2d:8435: ICMP6, neighbor advertisement, tgt is 2100::2, length 32

12:00:46.314416 3c:2c:99:2d:84:35 (oui Unknown) > 33:33:ff:00:00:02 (oui Unknown), ethertype IPv6 (0x86dd), length 86: fe80::3e2c:99ff:fe2d:8435 > ff02::1:ff00:2: ICMP6, neighbor solicitation, who has 2100::2, length 32 12:00:46.314820 00:10:94:00:00:09 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype IPv6 (0x86dd), length 86: 2100::2 > fe80::3e2c:99ff:fe2d:8435: ICMP6, neighbor advertisement, tgt is 2100::2, length 32

12:06:46.350847 3c:2c:99:2d:84:35 (oui Unknown) > 33:33:ff:00:00:02 (oui Unknown), ethertype IPv6 (0x86dd), length 86: fe80::3e2c:99ff:fe2d:8435 > ff02::1:ff00:2: ICMP6, neighbor solicitation, who has 2100::2, length 32 12:06:46.351333 00:10:94:00:00:09 (oui Unknown) > 3c:2c:99:2d:84:35 (oui Unknown), ethertype IPv6 (0x86dd), length 86: 2100::2 > fe80::3e2c:99ff:fe2d:8435: ICMP6, neighbor advertisement, tgt is 2100::2, length 32

  1. More compliant test results will be updated Details if related

ARP Refresh Thread:

ARP refresh functionality is added to neighsyncd process.

Neighsyncd is responsible for syncing the kernel ARP table to the hardware via the APP_DB and OrchAgents. Neighsyncd listens on netlink events (RTM_NEWNEIGH, RTM_DELNEIGH) and creates/deletes NEIGH_TABLE entries in APP_DB.

Existing functionality of neighsyncd is retained as it is. In addition to managing NEIGH_TABLE entries in APP_DB, neighsyncd will also add the details of the neighbor into a queue towards the new ARP Refresh thread described below.

A new ARP refresh thread is created in neighsyncd: - to dequeue the neighbor events and populate a neighbor cache. to periodically refresh ARP/ND by sending ARP request pkt / NS pkt to subscribe to redis-db to gather the data required to send the ARP refresh packets.

Following are the different modules in the ARP refresh thread.

Neighbor Cache Management Add neighbor entries to cache when the entry is learned from the kernel All Dynamically learned neighbor entries [ARP, ND (Global, LinkLocal)] All Static neighbor entries (MAC can be dynamic) Below entries will not be added to the neighbor cache Neighbors learned from “eth0” interface Neighbors learned from BGP/EVPN MAC/IP type-2 route MYIPaddress entries /// FF:FF:FF:FF:FF:FF Permanent entries Remove entries from cache when the entry is deleted from Kernel v4/v6 Neighbors Cache [map] contents are: - Key = IP Address + InterfaceName [Phy/PortChannel/Vlan/Sag] Value MAC Address State (Reachable/Failed)
Timestamp (Entry creation/last refresh)

Interface Cache Management Required for framing the ARP packets we send Interface Cache [Map] Key = Interface name Value = IP, MAC, Ifname to Index Subscribe to redis-db tables IP address - CONFIG_DB: INTERFACE, VLAN_INTERFACE, SAG_INTERFACE MAC - CONFIG_DB: DEVICE_METADATA ==> System MAC - CONFIG_DB: SAG_GLOBAL Ifname to Index (required for socket send)

Packet Builder Based on Neighbor Cache Build ARP packet Build NS packet For Resolved ARP Dst MAC, the ARP request is unicast For Unresolved ARP, Dst MAC the ARP request is broadcast IPv6 NS uses multicast

Send Refresh Send ARP/NS packets using raw socket Separate sockets for ARP and ICMPv6 NS Send Unicast packet VLAN tagging & FDB lookup happens in kernel based on outgoing interface

Refresh Timer Traverse the neighbor Cache entries periodically (every 30 secs) Check refresh timeout has elapsed for every neighbors If elapsed then send ARP/NS packet

Refresh timeout Calculation:

To avoid sending all ARP/NS packets simultaneously, each neighbor entry will be configured with different refresh timeout value. This refresh timeout value will be based on MAC/ARP/NS aging time.

ARP Reference Timeout (ARP_RT) = Lesser of [MAC age, ARP age] ND Reference Timeout (ND_RT) = Lesser of [MAC age, ND age]

Refresh Timeout = 30% to 70% of [ARP/ND Reference Timeout]

For example:

MAC Age is less than ARP age ARP ageout = 60 mins MAC ageout = 30 mins Reference Timeout = 30 mins. Refresh timeout will be between 30% to 70% of reference timeout (9 to 21) mins.

ARP Age is less than MAC age ARP ageout = 60 mins MAC ageout = 90 mins Reference Timeout = 60 mins Refresh timeout will be between 30% to 70% of reference timeout (18 to 42) mins.

Refresh timeout will be set whenever the neighbor entry is added/updated in cache, it will also be recomputed after sending the ARP/NS refresh packet.

Recommended Configurations:

ARP scale MAC aging timer (min) ARP Aging timer (min)
2000 10 (default) 30 ( default)
4000 10 30 (default)
6000 20 60
24K 40 60
32K 60 90

sumanbrcm avatar Dec 15 '20 19:12 sumanbrcm

This was discussed offline with BCM and agreed on the approach to move the refresh mechanism to neighbor manager. Awaiting PR update

prsunny avatar Feb 10 '21 17:02 prsunny