sonic-swss
sonic-swss copied to clipboard
[orchagent] RouteOrch cannot consume new routes if there are enough routes being tried in the m_toSync
Description
When there are many routes being retried in the consumer.m_toSync
of ROUTE TABLE all the time (be blocked by the Neighbor non-existance or something), the Consumer will be not able to pops()
any new routes by calling the Consumer::execute()
function. The amount of the retrying routes to trigger this issue depends on the shortest Timer whose priority is higher than the ROUTE TABLE Consumer. The priority of the ROUTE TABLE Consumer is 5.
Steps to reproduce the issue
- Distribute routes referencing NHG 5822 which does not exist or is deleted earlier
- Diliver NHG 16518
- Updating all the routes to reference NHG 16518
Describe the results you received
The old routes are retrying all the time & the new routes cannot be consumed. RouteOrch stucks here.
Describe the results you expected
New routes are able to be consumed and processed by route orch properly.
Output of show version
Output of show techsupport
(paste your output here or download and attach the file here)
Root cause of this issue
In the OrchDaemon::start()
, a Selectable
is selected and its execute()
function will be called. After that, doTask()
of all orchs will be triggered and retry all the remaining tasks. Therefore, if there are enough routes being retried, and there is a Timer whose priority is higher than the ROUTE TABLE Consumer, and the interval of this Timer is shorter than the retrying duration, the ROUTE TABLE Consumer will never be selected. In other words, new routes will never be consumed.
Additional information you deem important (e.g. issue happens only occasionally):
This was triggered occasionally in our testbed where the BGP was flapping and some interfaces were shutting down & starting up. And it may contribute to this issue that we have an additional Timer whose interval is 50ms.
Possible solution
Modify the mechanism for retrying. For example, we can do the retry operation every two loops. We can also limit this change within only the route orch to narrow the influencing scope.