sonic-swss icon indicating copy to clipboard operation
sonic-swss copied to clipboard

[teammgr] Added LAG member check into addLagMember()

Open akokhan opened this issue 3 years ago • 8 comments

Signed-off-by: Andriy Kokhan [email protected]

What I did Added a check into addLagMember() whether this new LAG member still exists in the kernel.

Why I did it During syncd container autorestart scenario, on syncd exit, the host interfaces (tun/tap netdevs) go to the DOWN state and then get removed.

Due to the validation as follows, the teammgr will receive the notification about the port state change (the information will be updated in the state DB and pubsub message sent) but the port state record will not be removed from the state DB on port delete: https://github.com/sonic-net/sonic-swss/blob/7cc035f93d028ea95488ce54e833d1699e3fd08a/portsyncd/linksync.cpp#L210

Due to this, on port state change notification, the isPortStateOk() will succeed and TeamMgr::addLagMember() will be executed even the host interface was actually removed.

The operation is expected to be ignored if the port is already enslaved: https://github.com/sonic-net/sonic-swss/blob/7cc035f93d028ea95488ce54e833d1699e3fd08a/cfgmgr/teammgr.cpp#L721

The check fails since the port has already been removed: https://github.com/sonic-net/sonic-swss/blob/7cc035f93d028ea95488ce54e833d1699e3fd08a/cfgmgr/teammgr.cpp#L412

As a result, the TeamMgr::addLagMember() logic will be executed and failed:

Jun 21 11:47:12.265955 cab18-7-dut INFO teamd#/supervisord: teammgrd Cannot find device "Ethernet0"
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message received: "NoSuchDev"
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message content: "No such device."
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd command call failed (Invalid argument)
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message received: "NoSuchDev"
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message content: "No such device."
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd command call failed (Invalid argument)
Jun 21 11:47:12.328844 cab18-7-dut ERR teamd#teammgrd: :- checkPortIffUp: Failed to get port Ethernet0 flags
Jun 21 11:47:12.328844 cab18-7-dut ERR teamd#teammgrd: :- addLagMember: Failed to add Ethernet0 to port channel PortChannel102

The issue started to reproduce after https://github.com/sonic-net/sonic-swss/pull/2233

How I verified it

autorestart/test_container_autorestart.py -k 'syncd' 

akokhan avatar Sep 20 '22 13:09 akokhan

@liorghub , @prsunny , please take a look

akokhan avatar Sep 22 '22 13:09 akokhan

When syncd container auto-restart, it will also restart swss and cleanup the STATE_DB. so which scenario is we encounter this?

prsunny avatar Sep 23 '22 20:09 prsunny

@liorghub , seems to be introduced after #2233. Can you please review this?

prsunny avatar Sep 23 '22 20:09 prsunny

When syncd container auto-restart, it will also restart swss and cleanup the STATE_DB. so which scenario is we encounter this?

On syncd exit, the host interfaces (tun/tap netdevs) go to the DOWN state and then get removed. Looks like this triggers the notification to teammgr before STATE_DB gets cleaned up. That's why teammgr tries to add LAG members which do not exists any more.

akokhan avatar Sep 27 '22 12:09 akokhan

@judyjoseph , @prsunny , please review. Thanks

akokhan avatar Sep 29 '22 05:09 akokhan

@judyjoseph , @prsunny , please review. Thanks

akokhan avatar Sep 30 '22 17:09 akokhan

@judyjoseph , @prsunny , please review. Thanks

Taking a look @akokhan -- thanks

judyjoseph avatar Sep 30 '22 20:09 judyjoseph

@judyjoseph , did you get a chance to review this? Thanks

akokhan avatar Oct 05 '22 08:10 akokhan

@judyjoseph , @prsunny , please review. Thanks

msosyak avatar Oct 19 '22 15:10 msosyak

@judyjoseph , did you get a chance to check this PR? Thanks

akokhan avatar Oct 25 '22 14:10 akokhan

@prsunny , please approve and merge. Thank you.

akokhan avatar Oct 28 '22 10:10 akokhan