sonic-utilities icon indicating copy to clipboard operation
sonic-utilities copied to clipboard

[GCU/swss] ERR#swss removeLag error in SYSLOG

Open wen587 opened this issue 2 years ago • 3 comments

Description

There seems to have some execution delay in swss when executing GCU jsonChange. The delay will cause SYSLOG ERR about removeLag. Possible execution delay related code: (Executed before portchannel removal)

  • ERR swss#intfmgrd: :- setIntfVrf: Command '/sbin/ip link set "PortChannel0005" nomaster' failed with rc 1

    https://github.com/Azure/sonic-swss/blob/master/cfgmgr/intfmgr.cpp#L136-L154

  • ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0005

    https://github.com/Azure/sonic-swss/blob/master/orchagent/portsorch.cpp#L5105-L5115

See below for more details.

Steps to reproduce the issue

  1. Add one portchannel and its interface into configDB through GCU.
admin@vlab-01:~/po/test$ cat tc1.json
[
        {"path": "/PORTCHANNEL/PortChannel0005", "value": {"admin_status": "up"}, "op": "add"},
        {"path": "/PORTCHANNEL_INTERFACE/PortChannel0005", "value": {}, "op": "add"},
        {"path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131", "value": {}, "op": "add"},
        {"path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126", "value": {}, "op": "add"}]

admin@vlab-01:~/po/test$ sudo config apply-patch tc1.json
...
Patch Applier: Applying 4 changes in order:
Patch Applier:   * [{"op": "add", "path": "/PORTCHANNEL/PortChannel0005", "value": {"admin_status": "up"}}]
Patch Applier:   * [{"op": "add", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005", "value": {}}]
Patch Applier:   * [{"op": "add", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131", "value": {}}]
Patch Applier:   * [{"op": "add", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126", "value": {}}]
Patch Applier: Verifying patch updates are reflected on ConfigDB.
Patch Applier: Patch application completed.
Patch applied successfully.
  1. Remove or rollback the previous change. Check SYSLOG ERR.
admin@vlab-01:~/po/test$ cat tc1_rm.json
[
 {
  "op": "remove",
  "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126"
 },
 {
  "op": "remove",
  "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131"
 },
 {
  "op": "remove",
  "path": "/PORTCHANNEL_INTERFACE/PortChannel0005"
 },
 {
  "op": "remove",
  "path": "/PORTCHANNEL/PortChannel0005"
 }
]
admin@vlab-01:~/po/test$ sudo config apply-patch tc1_rm.json
...
Patch Applier: Applying 4 changes in order:
Patch Applier:   * [{"op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005"}]
Patch Applier:   * [{"op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131"}]
Patch Applier:   * [{"op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126"}]
Patch Applier:   * [{"op": "remove", "path": "/PORTCHANNEL/PortChannel0005"}]
Patch Applier: Verifying patch updates are reflected on ConfigDB.
Patch Applier: Patch application completed.
Patch applied successfully.

SYSLOG ERR:

May 10 03:17:06.199268 vlab-01 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0005
May 10 03:17:06.199325 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel0005'. Skipping
May 10 03:17:07.241855 vlab-01 ERR swss#intfmgrd: :- setIntfVrf: Command '/sbin/ip link set "PortChannel0005" nomaster' failed with rc 1
May 10 03:17:07.241855 vlab-01 ERR swss#orchagent: message repeated 4 times: [ :- removeLag: Failed to remove ref count 1 LAG PortChannel0005]

From my undestanding. ERR teamd#tlm_teamd: :- get_dump is acceptable. It will occur even through config CLI config portchannel del <>.

  1. If we remove PORTCHANNEL_INTERFACE first then remove PORTCHANNEL, there will be no error. So I am wondering there is execution delay in swss. Splitting to two apply-patch and no error occur:
admin@vlab-01:~/po/test$ cat tc1_part1.json
[
 {
  "op": "remove",
  "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126"
 },
 {
  "op": "remove",
  "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131"
 },
 {
  "op": "remove",
  "path": "/PORTCHANNEL_INTERFACE/PortChannel0005"
 }]

admin@vlab-01:~/po/test$ cat tc1_part2.json
[
 {
  "op": "remove",
  "path": "/PORTCHANNEL/PortChannel0005"
 }
]

Describe the results you received

SYSLOG ERR when remove Portchannel and its Portchannel Interface together through GCU.

Describe the results you expected

Not sure if we can avoid that ERR or just think of it as valid message.

Additional information you deem important (e.g. issue happens only occasionally)

Output of show version

admin@vlab-01:~/po/test$ show ver

SONiC Software Version: SONiC.master-10763.96436-aa5cdcc51
Distribution: Debian 11.3
Kernel: 5.10.0-8-2-amd64
Build commit: aa5cdcc51
Build date: Fri May  6 06:25:04 UTC 2022
Built by: AzDevOps@sonic-build-workers-001HFQ

Platform: x86_64-kvm_x86_64-r0
HwSKU: Force10-S6000
ASIC: vs
ASIC Count: 1
Serial Number: N/A
Model Number: N/A
Hardware Revision: N/A
Uptime: 03:22:25 up 23:34,  2 users,  load average: 0.07, 0.16, 0.17
Date: Tue 10 May 2022 03:22:25

wen587 avatar May 10 '22 03:05 wen587

A few questions:

  • Regarding config portchannel del <>, are you saying the same errors occur there? or only ERR teamd#tlm_teamd: :- get_dump?
  • Also what does orcagent Failed to remove ref count 1 LAG PortChannel0005 error mean, is it checking redis or checking something else? if it is a redis issue, maybe we need to double check how we interact with redis, if it is async, I think we should make it sync or introduce some wait
  • What is LAG? why is the error referring to it

ghooo avatar May 10 '22 19:05 ghooo

A few questions:

  • Regarding config portchannel del <>, are you saying the same errors occur there? or only ERR teamd#tlm_teamd: :- get_dump?

Only ERR teamd#tlm_teamd: :- get_dump

  • Also what does orcagent Failed to remove ref count 1 LAG PortChannel0005 error mean, is it checking redis or checking something else? if it is a redis issue, maybe we need to double check how we interact with redis, if it is async, I think we should make it sync or introduce some wait

After read https://github.com/Azure/sonic-swss/blob/master/orchagent/portsorch.cpp#L5105-L5115, I think it is just not related to redis. Looks like a async issue. Not sure why m_port_ref_count is not 0 during the PortChannel Interface removal.

  • What is LAG? why is the error referring to it

LAG is link aggregation group, which is PortChannel in our code base. LAG removal refers to PortChannel removal.

wen587 avatar May 11 '22 06:05 wen587

It does not impact the final result. Current workaround is to keep the Log Analyzer error in ignored list.

wen587 avatar May 18 '22 04:05 wen587