sonic-swss
sonic-swss copied to clipboard
Fix flow counter out-of-order issue by notifying counter operations using SelectableChannel
What I did
Fix flow counter out-of-order issue by notifying counter operations using SelectableChannel Depends on https://github.com/sonic-net/sonic-sairedis/pull/1362
Why I did it
Currently, the operations of SAI objects and their counters (if any) are triggered by different channels, which introduces racing conditions:
- the creation and destruction of the objects are notified using the
SelectableChannel, - the operations of counters, including starting and stopping polling the counters, are notified by listening to the
FLEX_COUNTERandFLEX_COUNTER_GROUPtables in theFLEX_COUNTER_DB - The orchagent always respects the order when starting/stopping counter-polling (which means to start counter-polling after creating the object and to stop counter-polling before destroying the object) but
syncdcan receive events in a wrong order, eg. it receives destroying an object first and then stopping counter polling on the object, it can poll counter for a non-exist object, which causes errors in vendor SAI.
The new solution is to extend SAI redis attributes on the SAI_SWITCH_OBJECT to notify counter polling. As a result, all the objects and their counters are notified using a unified channel, the SelectableChannel.
How I verified it
Manual test Regression test Mock test
Details if related
- Extend SAI redis attribute on the switch object to identify counter-polling operations.
- Introduce a CLI option of orchagent to identify whether the new or the old approach is used for notifying counter-polling. It can not be changed on the fly.
- Move the logic to notify counter polling from each orchagent class and flex counter manager class to a commonplace. The logic is:
- For the new approach, notify counter polling using SAI extended redis attribute
- For the old approach, notify counter-polling using flex database tables. The corresponding flex database table objects (
ProducerTable) are initialized during orchagent initialization, before any flex counter operations.
- Remove the definition of and the references to flex counter and flex counter group table in each orchagent class (usually
m_flexCounterTableandm_flexCounterGroupTable) - Improve the logic to load the counters' Lua plugin for mock test. The Lua scripts are not available in the mock test scenarios, which causes exceptions to load them in mock test. In case it fails to load the Lua plugin,
- Originally, it skipped any flex counter group operations.
- Now, it will leave the SHA code of the Lua plugin empty, and continue to notify the flex counter operations to SAI redis.
- In the production scenario, where it can hardly happen, the plugin field will not be notified to SAI.
- In the mock test scenario, it will notify SAI as it is. This is for mock test to verify whether the plugin field is provided as expected.
- Gearbox counter handling.
- Maintain a mapping from each OID created based on the gearbox, from the object ID itself to the gearbox's OID.
- To notify counter operations for such OIDs in the new approach, we need to first fetch the corresponding gearbox OID from the mapping and the notify the gearbox OID as
switch OID. - The old approach looks buggy.
- There can be more than one gearbox syncd docker in the system, which is a P2MP scenario, the OA needs to communicate to multiple gearbox syncd dockers.
- The old approach leverages
ConsumerTable,ProducerTablemechanism to communicate between OA and sairedis. However, it works for P2P scenarios only. There is a logic forConsumerTableto consume the update once a gearbox syncd sees it, leaving all rest gearbox syncd daemons to see nothing. As a result, for each update there is only one gearbox syncd that sees and handles it.
- There is a logic in intfsorch, copporch and vxlanorch which is to add a counter for an object only after it occurs in the
VIRTORIDtable, which is a WA of the counter out-of-order issue. Now that the issue has been fixed in the new approach, it does not need to check anymore.
Performance analysis The counter operations are handled in the same thread in both the new and old solutions. In swss, the counter operation was asynchronous in the old solution and is synchronous now, which can introduce a bit more latency. However, as the number of counter operations is small, no performance degradation is observed.
Please fix build issues
Please fix build issues
Hi @kcudnik It depends on the sairedis PR (https://github.com/sonic-net/sonic-sairedis/pull/1362) to define extended attributes.
Will review
/azpw run
/AzurePipelines run
Azure Pipelines successfully started running 1 pipeline(s).
/azpw run
/azpw run
As mentioned in the description, the PR depends on sairedis PR (https://github.com/sonic-net/sonic-sairedis/pull/1362) merged a week ago. However, the sairedis Debian packages fetched in the build were TOO OLD to satisfy the dependency.
##[debug]ProjectId: be1b070f-be15-4154-aade-b1d3bfb17054
##[debug]PipelineVersionToDownload: latestFromBranch
##[debug]Agent environment resources - Disk: / Available 25389.00 MB out of 29593.00 MB, Memory: Used 701.00 MB out of 15968.00 MB, CPU: Usage 2.53%
##[debug]PipelineId from non-trigerringBuild: 495079
Download from the specified build: #495079
##[debug]Processed: ##vso[task.setvariable variable=BuildNumber;issecret=False;]495079
Download artifact to: /data/work/1/a/download/sairedis
Using default max parallelism.
It fetched 495079 in the build. However, it was built on 9th March, more than 2 weeks before my sairedis PR. I think this is because there has been no successful build since 495079. Can anyone help to look into it?
511120 20240330.8 master Azure.sonic-sairedis failed 2024-03-30T09:00:59 2024-03-30T10:55:25 594944dbe4 [Build Link](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=511120) [Artifacts](https://sonic-build.azurewebsites.net/ui/sonic/pipelines/12/builds/511120/artifacts?branchName=master)
508223 20240326.3 master Azure.sonic-sairedis failed 2024-03-26T14:26:18 2024-03-26T15:20:54 594944dbe4 [Build Link](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=508223) [Artifacts](https://sonic-build.azurewebsites.net/ui/sonic/pipelines/12/builds/508223/artifacts?branchName=master)
505863 20240323.8 master Azure.sonic-sairedis failed 2024-03-23T09:01:19 2024-03-23T10:32:08 bb948f632b [Build Link](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=505863) [Artifacts](https://sonic-build.azurewebsites.net/ui/sonic/pipelines/12/builds/505863/artifacts?branchName=master)
500443 20240316.8 master Azure.sonic-sairedis failed 2024-03-16T09:01:01 2024-03-16T10:48:54 bb948f632b [Build Link](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=500443) [Artifacts](https://sonic-build.azurewebsites.net/ui/sonic/pipelines/12/builds/500443/artifacts?branchName=master)
@stephenxs can you please handle the conflict? @kcudnik thanks for the great feedback. assuming conflicts are handled, any further feedback on the PR or this can be approved from your end?
/azpw run
/AzurePipelines run
Azure Pipelines successfully started running 1 pipeline(s).
Rebased to resolve the conflict.
@stephenxs can you please handle the conflict? @kcudnik thanks for the great feedback. assuming conflicts are handled, any further feedback on the PR or this can be approved from your end?
i added some potential refactoring changes that could be desired, but i approved current solution, but i don't have permission to approve that, you need @qiluo-msft approve or @prsunny approve
@qiluo-msft @prsunny , Can you please approve and merge ?