sonic-buildimage icon indicating copy to clipboard operation
sonic-buildimage copied to clipboard

[T2][202405] Orchagent crashes when running `acl/test_acl.py` sonic-mgmt test

Open arista-nwolfe opened this issue 4 months ago • 10 comments

When running acl/test_acl.py, specifically TestAclWithPortToggle, we've seen orchagent crashes due to attempts to delete nexthops that don't exist: Backtrace:

#0  0x00007f7e66107e3c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f7e660b8fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f7e660a3472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x0000556e9607ce9d in handleSaiFailure (abort_on_failure=abort_on_failure@entry=true) at ./orchagent/saihelper.cpp:834
#4  0x0000556e9631afc5 in handleSaiRemoveStatus (api=api@entry=SAI_API_NEXT_HOP_GROUP, status=<optimized out>, context=context@entry=0x0)
    at ./orchagent/saihelper.cpp:733
#5  0x0000556e961a6fe7 in RouteOrch::removeNextHopGroup (this=this@entry=0x556e9835c430, nexthops=...) at ./orchagent/routeorch.cpp:1470
#6  0x0000556e961afa1b in RouteOrch::doTask (this=0x556e9835c430, consumer=...) at ./orchagent/routeorch.cpp:1038
#7  0x0000556e9617441d in Consumer::drain (this=0x556e9835dca0) at ./orchagent/orch.cpp:258
#8  Consumer::drain (this=0x556e9835dca0) at ./orchagent/orch.cpp:255
#9  Consumer::execute (this=0x556e9835dca0) at ./orchagent/orch.cpp:252
#10 0x0000556e9616034a in OrchDaemon::start (this=this@entry=0x556e9829a480) at ./orchagent/orchdaemon.cpp:873
#11 0x0000556e960cdd59 in main (argc=<optimized out>, argv=<optimized out>) at ./orchagent/main.cpp:810

Syslog:

2024 Oct 23 18:11:23.271424 cmp206-5 NOTICE swss0#orchagent: :- removeNextHopGroup: Delete next hop group 10.0.0.93@Ethernet-IB0,10.0.0.99@Ethernet-IB0,10.0.0.101@Ethernet-IB0,10.0.0.103@Ethernet-IB0,10.0.0.105@Ethernet-IB0,10.0.0.107@Ethernet-IB0,10.0.0.109@Ethernet-IB0,10.0.0.111@Ethernet-IB0,10.0.0.113@Ethernet-IB0,10.0.0.115@Ethernet-IB0,10.0.0.117@Ethernet-IB0,10.0.0.119@Ethernet-IB0,10.0.0.121@Ethernet-IB0,10.0.0.123@Ethernet-IB0,10.0.0.125@Ethernet-IB0,10.0.0.127@Ethernet-IB0,10.0.0.167@Ethernet-IB0,10.0.0.169@Ethernet-IB0,10.0.0.171@Ethernet-IB0,10.0.0.173@Ethernet-IB0,10.0.0.175@Ethernet-IB0,10.0.0.177@Ethernet-IB0,10.0.0.179@Ethernet-IB0,10.0.0.181@Ethernet-IB0,10.0.0.183@Ethernet-IB0,10.0.0.185@Ethernet-IB0,10.0.0.187@Ethernet-IB0,10.0.0.189@Ethernet-IB0,10.0.0.191@Ethernet-IB0

2024 Oct 23 18:11:23.271424 cmp206-5 ERR swss0#orchagent: :- meta_sai_validate_oid: object key SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER:oid:0x2d000000001ee7 doesn't exist
2024 Oct 23 18:11:23.271424 cmp206-5 ERR swss0#orchagent: :- flush_removing_entries: ObjectBulker.flush remove entries failed, number of entries to remove: 29, status: SAI_STATUS_ITEM_NOT_FOUND
2024 Oct 23 18:11:23.271448 cmp206-5 ERR swss0#orchagent: :- removeNextHopGroup: Failed to remove next hop group member[0] 2d0000000020fd, rv:-23
2024 Oct 23 18:11:23.271448 cmp206-5 ERR swss0#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP_GROUP, status: SAI_STATUS_NOT_EXECUTED

When I look for that object Id oid:0x2d000000001ee7 in the sairedis logs I see that it was created and deleted ~20s after each other

2024-10-23.18:08:35.276522|C|SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER||oid:0x2d000000001ed1|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_GROUP_ID=oid:0x5000000001ed0|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_ID=oid:0x4000000000c27||…||oid:0x2d000000001ee7|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_GROUP_ID=oid:0x5000000001ed0|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_ID=oid:0x400000000100c||…||

2024-10-23.18:08:55.274258|r|SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER:oid:0x2d000000001ee7

So this justifies the SAI error (the entry isn't in ASIC_DB), but I can't seem to find any obvious reason for the nexthop being removed at 18:08:55 in the syslog. The only thing note-worthy is that TestAclWithPortToggle triggered the shutdown of all ports at 18:08:08:

2024 Oct 23 18:08:08.458143 cmp206-5 INFO python[23588]: ansible-shell_cmds Invoked with cmds=['config interface -n asic0 shutdown Ethernet0', 'config interface -n asic0 shutdown Ethernet8', ...

And I see this nexthop brought down around the same time, but it's not the same as the next hop listed at the crash:

2024 Oct 23 18:08:53.408694 cmp206-5 NOTICE swss0#orchagent: :- removeNextHopGroup: Delete next hop group 10.0.0.1@Ethernet-IB0,10.0.0.5@Ethernet-IB0,10.0.0.9@Ethernet-IB0,10.0.0.13@Ethernet-IB0,10.0.0.17@Ethernet-IB0,10.0.0.21@Ethernet-IB0,10.0.0.25@Ethernet-IB0,10.0.0.29@Ethernet-IB0,10.0.0.33@Ethernet-IB0,10.0.0.35@Ethernet-IB0,10.0.0.37@Ethernet-IB0,10.0.0.39@Ethernet-IB0,10.0.0.41@Ethernet-IB0,10.0.0.43@Ethernet-IB0,10.0.0.47@Ethernet-IB0,10.0.0.49@Ethernet-IB0,10.0.0.51@Ethernet-IB0,10.0.0.53@Ethernet-IB0,10.0.0.55@Ethernet-IB0,10.0.0.57@Ethernet-IB0,10.0.0.59@Ethernet-IB0,10.0.0.61@Ethernet-IB0,10.0.0.63@Ethernet-IB0
2024 Oct 23 18:08:53.408880 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.1@Ethernet-IB0 with next_hop_id 2d0000000010ac
2024 Oct 23 18:08:53.409005 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.5@Ethernet-IB0 with next_hop_id 2d0000000010ad
2024 Oct 23 18:08:53.409125 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.9@Ethernet-IB0 with next_hop_id 2d0000000010ae
2024 Oct 23 18:08:53.409243 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.13@Ethernet-IB0 with next_hop_id 2d0000000010af
2024 Oct 23 18:08:53.409361 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.17@Ethernet-IB0 with next_hop_id 2d0000000010b0
2024 Oct 23 18:08:53.409478 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.21@Ethernet-IB0 with next_hop_id 2d0000000010b1
2024 Oct 23 18:08:53.409593 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.25@Ethernet-IB0 with next_hop_id 2d0000000010b2
2024 Oct 23 18:08:53.409730 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.29@Ethernet-IB0 with next_hop_id 2d0000000010b3
2024 Oct 23 18:08:53.409847 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.33@Ethernet-IB0 with next_hop_id 2d0000000010b4
2024 Oct 23 18:08:53.409962 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.35@Ethernet-IB0 with next_hop_id 2d0000000010b5
2024 Oct 23 18:08:53.410096 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.37@Ethernet-IB0 with next_hop_id 2d0000000010b6
2024 Oct 23 18:08:53.410213 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.39@Ethernet-IB0 with next_hop_id 2d0000000010b7
2024 Oct 23 18:08:53.410331 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.41@Ethernet-IB0 with next_hop_id 2d0000000010b8
2024 Oct 23 18:08:53.410448 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.43@Ethernet-IB0 with next_hop_id 2d0000000010b9
2024 Oct 23 18:08:53.410563 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.47@Ethernet-IB0 with next_hop_id 2d0000000010ba
2024 Oct 23 18:08:53.410678 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.49@Ethernet-IB0 with next_hop_id 2d0000000010bb
2024 Oct 23 18:08:53.410793 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.51@Ethernet-IB0 with next_hop_id 2d0000000010bc
2024 Oct 23 18:08:53.410907 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.53@Ethernet-IB0 with next_hop_id 2d0000000010bd
2024 Oct 23 18:08:53.411024 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.55@Ethernet-IB0 with next_hop_id 2d0000000010be
2024 Oct 23 18:08:53.411140 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.57@Ethernet-IB0 with next_hop_id 2d0000000010bf
2024 Oct 23 18:08:53.411257 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.59@Ethernet-IB0 with next_hop_id 2d0000000010c0
2024 Oct 23 18:08:53.411375 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.61@Ethernet-IB0 with next_hop_id 2d0000000010c1
2024 Oct 23 18:08:53.411491 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.63@Ethernet-IB0 with next_hop_id 2d0000000010c2

Steps to reproduce the issue:

  1. run acl/test_acl.py::TestAclWithPortToggle on a T2 system (Not 100% reproducible, may take a few tries)

NOTE: This is also seen on t2-min topologies

arista-nwolfe avatar Oct 24 '24 21:10 arista-nwolfe