frr icon indicating copy to clipboard operation
frr copied to clipboard

dplane failed to limit the maximum length of the queue

Open zice312963205 opened this issue 1 year ago • 6 comments


Describe the bug

FRR version: 8.2.2 kernel version:Linux 5.10

When I learn 2 million routes from BGP neighbors in a short period of time, dplane consumes a large amount of cache. image image image image

It seems there is an issue where if a large number of routes are learned in a short period of time, dplane will occupy a substantial amount of cache memory, which might lead to an out-of-memory (OOM) situation.

  • [X] Did you check if this is a duplicate issue?
  • [ ] Did you test it on the latest FRRouting/frr master branch?

To Reproduce

Expected behavior

Screenshots

Versions

  • OS Version:
  • Kernel:
  • FRR Version:

Additional context

zice312963205 avatar Dec 14 '23 01:12 zice312963205

Can we see the cli you have for the command line of zebra?

donaldsharp avatar Dec 14 '23 11:12 donaldsharp

I'd like to see a show thread cpu as well

donaldsharp avatar Dec 14 '23 13:12 donaldsharp

I'd like to see a show thread cpu as well

image

zice312963205 avatar Dec 14 '23 13:12 zice312963205

Why didn't you include the entirety of the show thread cpu output?

In any event I was able to recreate something similiar in my home setup. I am not sure if this is what you are reporting, but it probaby is, can you give this a try: https://github.com/FRRouting/frr/pull/15025 and see if it cleans the problem up

donaldsharp avatar Dec 14 '23 14:12 donaldsharp

Why didn't you include the entirety of the show thread cpu output?

In any event I was able to recreate something similiar in my home setup. I am not sure if this is what you are reporting, but it probaby is, can you give this a try: #15025 and see if it cleans the problem up

I tracked the code flow of ctx and found that the problem arises because after ctx is processed by the provider, it is all hung on the rib_dplane_q for caching. Then, the value of zdplane_info.dg_routes_queued will be reduced, which leads to the failure of the attempt to limit the number of ctx processed each time (200) in the function meta_queue_process. Since rib_process_dplane_results is executed in the main thread of zebra, scheduling will be relatively slow. Therefore, when a large number of routes are injected in a short time, a lot of temporary caches will be hung on the rib_dplane_q, which in turn causes ctx to not be released in time, leading to this problem.

I have an idea for modification, which is to attempt to judge the length of rib_dplane_q in the function meta_queue_process. If there are already many cached nodes, then return WQ_QUEUE_BLOCKED to temporarily delay the processing of rib_process.

zice312963205 avatar Dec 19 '23 09:12 zice312963205

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

github-actions[bot] avatar Jun 17 '24 01:06 github-actions[bot]