frr dplane failed to limit the maximum length of the queue

Describe the bug

FRR version: 8.2.2 kernel version:Linux 5.10

When I learn 2 million routes from BGP neighbors in a short period of time, dplane consumes a large amount of cache.

It seems there is an issue where if a large number of routes are learned in a short period of time, dplane will occupy a substantial amount of cache memory, which might lead to an out-of-memory (OOM) situation.

[X] Did you check if this is a duplicate issue?
[ ] Did you test it on the latest FRRouting/frr master branch?

To Reproduce

Expected behavior

Screenshots

Versions

OS Version:

Kernel:

FRR Version:

Additional context

Dec 14 '23 01:12 zice312963205

Can we see the cli you have for the command line of zebra?

Dec 14 '23 11:12 donaldsharp

I'd like to see a show thread cpu as well

Dec 14 '23 13:12 donaldsharp

I'd like to see a show thread cpu as well

Dec 14 '23 13:12 zice312963205

Why didn't you include the entirety of the show thread cpu output?

In any event I was able to recreate something similiar in my home setup. I am not sure if this is what you are reporting, but it probaby is, can you give this a try: https://github.com/FRRouting/frr/pull/15025 and see if it cleans the problem up

Dec 14 '23 14:12 donaldsharp

Why didn't you include the entirety of the show thread cpu output?

In any event I was able to recreate something similiar in my home setup. I am not sure if this is what you are reporting, but it probaby is, can you give this a try: #15025 and see if it cleans the problem up

I tracked the code flow of ctx and found that the problem arises because after ctx is processed by the provider, it is all hung on the rib_dplane_q for caching. Then, the value of zdplane_info.dg_routes_queued will be reduced, which leads to the failure of the attempt to limit the number of ctx processed each time (200) in the function meta_queue_process. Since rib_process_dplane_results is executed in the main thread of zebra, scheduling will be relatively slow. Therefore, when a large number of routes are injected in a short time, a lot of temporary caches will be hung on the rib_dplane_q, which in turn causes ctx to not be released in time, leading to this problem.

I have an idea for modification, which is to attempt to judge the length of rib_dplane_q in the function meta_queue_process. If there are already many cached nodes, then return WQ_QUEUE_BLOCKED to temporarily delay the processing of rib_process.

Dec 19 '23 09:12 zice312963205

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

Jun 17 '24 01:06 github-actions[bot]

frr frr copied to clipboard

dplane failed to limit the maximum length of the queue

frr
frr copied to clipboard