sonic-swss icon indicating copy to clipboard operation
sonic-swss copied to clipboard

Fix flow counter out-of-order issue by notifying counter operations using SelectableChannel

Open stephenxs opened this issue 1 year ago • 15 comments

What I did

Fix flow counter out-of-order issue by notifying counter operations using SelectableChannel Depends on https://github.com/sonic-net/sonic-sairedis/pull/1362

Why I did it

Currently, the operations of SAI objects and their counters (if any) are triggered by different channels, which introduces racing conditions:

  • the creation and destruction of the objects are notified using the SelectableChannel,
  • the operations of counters, including starting and stopping polling the counters, are notified by listening to the FLEX_COUNTER and FLEX_COUNTER_GROUP tables in the FLEX_COUNTER_DB
  • The orchagent always respects the order when starting/stopping counter-polling (which means to start counter-polling after creating the object and to stop counter-polling before destroying the object) but syncd can receive events in a wrong order, eg. it receives destroying an object first and then stopping counter polling on the object, it can poll counter for a non-exist object, which causes errors in vendor SAI.

The new solution is to extend SAI redis attributes on the SAI_SWITCH_OBJECT to notify counter polling. As a result, all the objects and their counters are notified using a unified channel, the SelectableChannel.

How I verified it

Manual test Regression test Mock test

Details if related

  1. Extend SAI redis attribute on the switch object to identify counter-polling operations.
  2. Introduce a CLI option of orchagent to identify whether the new or the old approach is used for notifying counter-polling. It can not be changed on the fly.
  3. Move the logic to notify counter polling from each orchagent class and flex counter manager class to a commonplace. The logic is:
    • For the new approach, notify counter polling using SAI extended redis attribute
    • For the old approach, notify counter-polling using flex database tables. The corresponding flex database table objects (ProducerTable) are initialized during orchagent initialization, before any flex counter operations.
  4. Remove the definition of and the references to flex counter and flex counter group table in each orchagent class (usually m_flexCounterTable and m_flexCounterGroupTable)
  5. Improve the logic to load the counters' Lua plugin for mock test. The Lua scripts are not available in the mock test scenarios, which causes exceptions to load them in mock test. In case it fails to load the Lua plugin,
    • Originally, it skipped any flex counter group operations.
    • Now, it will leave the SHA code of the Lua plugin empty, and continue to notify the flex counter operations to SAI redis.
      • In the production scenario, where it can hardly happen, the plugin field will not be notified to SAI.
      • In the mock test scenario, it will notify SAI as it is. This is for mock test to verify whether the plugin field is provided as expected.
  6. Gearbox counter handling.
    • Maintain a mapping from each OID created based on the gearbox, from the object ID itself to the gearbox's OID.
    • To notify counter operations for such OIDs in the new approach, we need to first fetch the corresponding gearbox OID from the mapping and the notify the gearbox OID as switch OID.
    • The old approach looks buggy.
      • There can be more than one gearbox syncd docker in the system, which is a P2MP scenario, the OA needs to communicate to multiple gearbox syncd dockers.
      • The old approach leverages ConsumerTable, ProducerTable mechanism to communicate between OA and sairedis. However, it works for P2P scenarios only. There is a logic for ConsumerTable to consume the update once a gearbox syncd sees it, leaving all rest gearbox syncd daemons to see nothing. As a result, for each update there is only one gearbox syncd that sees and handles it.
  7. There is a logic in intfsorch, copporch and vxlanorch which is to add a counter for an object only after it occurs in the VIRTORID table, which is a WA of the counter out-of-order issue. Now that the issue has been fixed in the new approach, it does not need to check anymore.

Performance analysis The counter operations are handled in the same thread in both the new and old solutions. In swss, the counter operation was asynchronous in the old solution and is synchronous now, which can introduce a bit more latency. However, as the number of counter operations is small, no performance degradation is observed.

stephenxs avatar Mar 13 '24 14:03 stephenxs

Please fix build issues

kcudnik avatar Mar 22 '24 16:03 kcudnik

Please fix build issues

Hi @kcudnik It depends on the sairedis PR (https://github.com/sonic-net/sonic-sairedis/pull/1362) to define extended attributes.

stephenxs avatar Mar 23 '24 01:03 stephenxs

Will review

kcudnik avatar Mar 23 '24 05:03 kcudnik

/azpw run

stephenxs avatar Mar 28 '24 23:03 stephenxs

/AzurePipelines run

mssonicbld avatar Mar 28 '24 23:03 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Mar 28 '24 23:03 azure-pipelines[bot]

/azpw run

stephenxs avatar Mar 29 '24 09:03 stephenxs

/azpw run

stephenxs avatar Mar 31 '24 00:03 stephenxs

As mentioned in the description, the PR depends on sairedis PR (https://github.com/sonic-net/sonic-sairedis/pull/1362) merged a week ago. However, the sairedis Debian packages fetched in the build were TOO OLD to satisfy the dependency.

##[debug]ProjectId: be1b070f-be15-4154-aade-b1d3bfb17054
##[debug]PipelineVersionToDownload: latestFromBranch
##[debug]Agent environment resources - Disk: / Available 25389.00 MB out of 29593.00 MB, Memory: Used 701.00 MB out of 15968.00 MB, CPU: Usage 2.53%
##[debug]PipelineId from non-trigerringBuild: 495079
Download from the specified build: #495079
##[debug]Processed: ##vso[task.setvariable variable=BuildNumber;issecret=False;]495079
Download artifact to: /data/work/1/a/download/sairedis
Using default max parallelism.

It fetched 495079 in the build. However, it was built on 9th March, more than 2 weeks before my sairedis PR. I think this is because there has been no successful build since 495079. Can anyone help to look into it?

511120	20240330.8	master	Azure.sonic-sairedis	failed	2024-03-30T09:00:59	2024-03-30T10:55:25	594944dbe4	[Build Link](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=511120)	[Artifacts](https://sonic-build.azurewebsites.net/ui/sonic/pipelines/12/builds/511120/artifacts?branchName=master)
508223	20240326.3	master	Azure.sonic-sairedis	failed	2024-03-26T14:26:18	2024-03-26T15:20:54	594944dbe4	[Build Link](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=508223)	[Artifacts](https://sonic-build.azurewebsites.net/ui/sonic/pipelines/12/builds/508223/artifacts?branchName=master)
505863	20240323.8	master	Azure.sonic-sairedis	failed	2024-03-23T09:01:19	2024-03-23T10:32:08	bb948f632b	[Build Link](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=505863)	[Artifacts](https://sonic-build.azurewebsites.net/ui/sonic/pipelines/12/builds/505863/artifacts?branchName=master)
500443	20240316.8	master	Azure.sonic-sairedis	failed	2024-03-16T09:01:01	2024-03-16T10:48:54	bb948f632b	[Build Link](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=500443)	[Artifacts](https://sonic-build.azurewebsites.net/ui/sonic/pipelines/12/builds/500443/artifacts?branchName=master)

stephenxs avatar Mar 31 '24 02:03 stephenxs

@stephenxs can you please handle the conflict? @kcudnik thanks for the great feedback. assuming conflicts are handled, any further feedback on the PR or this can be approved from your end?

liat-grozovik avatar Apr 16 '24 08:04 liat-grozovik

/azpw run

stephenxs avatar Apr 21 '24 23:04 stephenxs

/AzurePipelines run

mssonicbld avatar Apr 21 '24 23:04 mssonicbld

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Apr 21 '24 23:04 azure-pipelines[bot]

Rebased to resolve the conflict.

stephenxs avatar Apr 30 '24 03:04 stephenxs

@stephenxs can you please handle the conflict? @kcudnik thanks for the great feedback. assuming conflicts are handled, any further feedback on the PR or this can be approved from your end?

i added some potential refactoring changes that could be desired, but i approved current solution, but i don't have permission to approve that, you need @qiluo-msft approve or @prsunny approve

kcudnik avatar May 04 '24 08:05 kcudnik

@qiluo-msft @prsunny , Can you please approve and merge ?

dprital avatar May 06 '24 16:05 dprital