Unexpected error syslog due to negative refcnt of nexthop
Description
We noticed the following unexpected syslog in 202205 sonic-mgmt testing on T2:
E Feb 19 16:48:40.122676 cmp227-6 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000005844 with ip: 10.0.0.5 and alias: Ethernet-IB0
E
E Feb 19 16:48:51.104598 cmp227-6 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000005845 with ip: fc00::a and alias: Ethernet-IB0
The syslog was seen in voq tests, e.g., voq/test_voq_chassis_app_db_consistency.py.
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
Output of show version:
(paste your output here)
Output of show techsupport:
(paste your output here or download and attach the file here )
Additional information you deem important (e.g. issue happens only occasionally):
Add @arlakshm @kenneth-arista for visibility.
In another instance of the failure, we also observed orchagent crash along with the error syslogs
E Feb 19 16:56:54.058695 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
E
E Feb 19 16:57:03.828994 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8b with ip: 10.0.0.5 and alias: Ethernet-IB0
E
E Feb 19 16:57:03.846303 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8b with ip: 10.0.0.5 and alias: Ethernet-IB0
E
E Feb 19 16:57:03.858634 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8b with ip: 10.0.0.5 and alias: Ethernet-IB0
E
E Feb 19 16:57:03.869720 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
E
E Feb 19 16:57:03.878436 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
E
E Feb 19 16:57:03.886809 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
E
E Feb 19 16:57:03.895243 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
E
E Feb 19 16:57:03.903355 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
E
E Feb 19 16:57:03.906794 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
E
E Feb 19 16:57:03.914303 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
E
E Feb 19 16:57:03.922499 cmp227-5 ERR swss#orchagent: :- decreaseNextHopRefCount: Ref count cannot be negative for next_hop_id: 0x4000000010b8c with ip: fc00::a and alias: Ethernet-IB0
E
E Feb 19 16:57:28.426127 cmp227-5 ERR syncd#syncd: [none] SAI_API_NEIGHBOR:brcm_sai_dnx_set_neighbor_entry_attribute:1265 L3 host find failed with error Entry not found (0xfffffff9).
E
E Feb 19 16:57:28.426127 cmp227-5 ERR syncd#syncd: [none] SAI_API_NEIGHBOR:brcm_sai_set_neighbor_entry_attribute:637 pd neighbor set failed with error -7.
E
E Feb 19 16:57:28.426127 cmp227-5 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_SET failed in syncd mode: SAI_STATUS_ITEM_NOT_FOUND
E
E Feb 19 16:57:28.426232 cmp227-5 ERR syncd#syncd: :- processQuadEvent: attr: SAI_NEIGHBOR_ENTRY_ATTR_ENCAP_INDEX: 1074790415
E
E Feb 19 16:57:28.426539 cmp227-5 ERR swss#orchagent: :- updateVoqNeighborEncapIndex: Failed to update voq encap index for neighbor 10.0.0.5 on cmp227-4|asic0|PortChannel999, rv:-7
E
E Feb 19 16:57:28.426566 cmp227-5 ERR swss#orchagent: :- meta_sai_validate_oid: oid is set to null object id on SAI_OBJECT_TYPE_NEXT_HOP
E
E Feb 19 16:57:28.426584 cmp227-5 ERR swss#orchagent: :- removeNeighbor: Failed to remove next hop 10.0.0.5 on cmp227-4|asic0|PortChannel999, rv:-5
E
E Feb 19 16:57:28.426584 cmp227-5 ERR swss#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP, status: SAI_STATUS_INVALID_PARAMETER
@ysmanman, can you attach the tech-support to the issue.
@saksarav-nokia Please help to take a look at this issue.
@arlakshm @ysmanman , We don't see this error in our Nokia chassis with full OC run. Also we ran voq/test_voq_chassis_app_db_consistency.py tests ~75 times and we didn't see this error or the crash. with 202205. We need more details to reproduce the issue.
Thanks @saksarav-nokia for the update. Reassigning this issue to @ysmanman as this is crash is seen during their testing.
We ran into the failure in recent T2 sonic-mgmt test run. We observed that stale system neighbors were not deleted from chassis db after load-minigraph:
Mar 11 20:24:55.932512 cmp227-4 NOTICE root: Chassis db clean up for swss0. Number of SYSTEM_NEIGH entries deleted:
Mar 11 20:24:55.946421 cmp227-4 NOTICE root: Chassis db clean up for swss0. Number of SYSTEM_INTERFACE entries deleted: 12
Mar 11 20:24:55.960215 cmp227-4 NOTICE root: Chassis db clean up for swss0. Number of SYSTEM_LAG_MEMBER_TABLE entries deleted: 0
Mar 11 20:24:55.974263 cmp227-4 NOTICE root: Chassis db clean up for swss0. Number of SYSTEM_LAG_TABLE entries deleted: 9
Note the first syslog didn't show how many system neighbor entries were deleted. The implies chassis-db failed to eval the lua script used to delete system neighbor. The issue may be similar to https://github.com/sonic-net/sonic-buildimage/issues/17945.
@ysmanman , It is already fixed in https://github.com/sonic-net/sonic-buildimage/pull/17962
@ysmanman please let us know if you were able to check if this issue is resolved with https://github.com/sonic-net/sonic-buildimage/pull/17962
HI @judyjoseph , yes, we are in the process to verify if https://github.com/sonic-net/sonic-buildimage/pull/17962 fixes the issue. Given the issue happened intermittently, we need a couple of runs to confirm the issue is fixed or not.
We didn't see the error syslog in recent testing.