sonic-swss
sonic-swss copied to clipboard
[teammgr] Added LAG member check into addLagMember()
Signed-off-by: Andriy Kokhan [email protected]
What I did Added a check into addLagMember() whether this new LAG member still exists in the kernel.
Why I did it During syncd container autorestart scenario, on syncd exit, the host interfaces (tun/tap netdevs) go to the DOWN state and then get removed.
Due to the validation as follows, the teammgr will receive the notification about the port state change (the information will be updated in the state DB and pubsub message sent) but the port state record will not be removed from the state DB on port delete: https://github.com/sonic-net/sonic-swss/blob/7cc035f93d028ea95488ce54e833d1699e3fd08a/portsyncd/linksync.cpp#L210
Due to this, on port state change notification, the isPortStateOk() will succeed and TeamMgr::addLagMember() will be executed even the host interface was actually removed.
The operation is expected to be ignored if the port is already enslaved: https://github.com/sonic-net/sonic-swss/blob/7cc035f93d028ea95488ce54e833d1699e3fd08a/cfgmgr/teammgr.cpp#L721
The check fails since the port has already been removed: https://github.com/sonic-net/sonic-swss/blob/7cc035f93d028ea95488ce54e833d1699e3fd08a/cfgmgr/teammgr.cpp#L412
As a result, the TeamMgr::addLagMember() logic will be executed and failed:
Jun 21 11:47:12.265955 cab18-7-dut INFO teamd#/supervisord: teammgrd Cannot find device "Ethernet0"
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message received: "NoSuchDev"
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message content: "No such device."
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd command call failed (Invalid argument)
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message received: "NoSuchDev"
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message content: "No such device."
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd command call failed (Invalid argument)
Jun 21 11:47:12.328844 cab18-7-dut ERR teamd#teammgrd: :- checkPortIffUp: Failed to get port Ethernet0 flags
Jun 21 11:47:12.328844 cab18-7-dut ERR teamd#teammgrd: :- addLagMember: Failed to add Ethernet0 to port channel PortChannel102
The issue started to reproduce after https://github.com/sonic-net/sonic-swss/pull/2233
How I verified it
autorestart/test_container_autorestart.py -k 'syncd'
@liorghub , @prsunny , please take a look
When syncd container auto-restart, it will also restart swss and cleanup the STATE_DB. so which scenario is we encounter this?
@liorghub , seems to be introduced after #2233. Can you please review this?
When syncd container auto-restart, it will also restart swss and cleanup the STATE_DB. so which scenario is we encounter this?
On syncd exit, the host interfaces (tun/tap netdevs) go to the DOWN state and then get removed. Looks like this triggers the notification to teammgr before STATE_DB gets cleaned up. That's why teammgr tries to add LAG members which do not exists any more.
@judyjoseph , @prsunny , please review. Thanks
@judyjoseph , @prsunny , please review. Thanks
@judyjoseph , @prsunny , please review. Thanks
Taking a look @akokhan -- thanks
@judyjoseph , did you get a chance to review this? Thanks
@judyjoseph , @prsunny , please review. Thanks
@judyjoseph , did you get a chance to check this PR? Thanks
@prsunny , please approve and merge. Thank you.