frr
frr copied to clipboard
dplane failed to limit the maximum length of the queue
Describe the bug
FRR version: 8.2.2 kernel version:Linux 5.10
When I learn 2 million routes from BGP neighbors in a short period of time, dplane consumes a large amount of cache.
It seems there is an issue where if a large number of routes are learned in a short period of time, dplane will occupy a substantial amount of cache memory, which might lead to an out-of-memory (OOM) situation.
- [X] Did you check if this is a duplicate issue?
- [ ] Did you test it on the latest FRRouting/frr master branch?
To Reproduce
Expected behavior
Screenshots
Versions
- OS Version:
- Kernel:
- FRR Version:
Additional context
Can we see the cli you have for the command line of zebra?
I'd like to see a show thread cpu as well
I'd like to see a
show thread cpuas well
Why didn't you include the entirety of the show thread cpu output?
In any event I was able to recreate something similiar in my home setup. I am not sure if this is what you are reporting, but it probaby is, can you give this a try: https://github.com/FRRouting/frr/pull/15025 and see if it cleans the problem up
Why didn't you include the entirety of the show thread cpu output?
In any event I was able to recreate something similiar in my home setup. I am not sure if this is what you are reporting, but it probaby is, can you give this a try: #15025 and see if it cleans the problem up
I tracked the code flow of ctx and found that the problem arises because after ctx is processed by the provider, it is all hung on the rib_dplane_q for caching. Then, the value of zdplane_info.dg_routes_queued will be reduced, which leads to the failure of the attempt to limit the number of ctx processed each time (200) in the function meta_queue_process. Since rib_process_dplane_results is executed in the main thread of zebra, scheduling will be relatively slow. Therefore, when a large number of routes are injected in a short time, a lot of temporary caches will be hung on the rib_dplane_q, which in turn causes ctx to not be released in time, leading to this problem.
I have an idea for modification, which is to attempt to judge the length of rib_dplane_q in the function meta_queue_process. If there are already many cached nodes, then return WQ_QUEUE_BLOCKED to temporarily delay the processing of rib_process.
This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.