bgpd: Optimize output buffer memory for show command
The vtysh process sends the show command to the bgpd process, then the bgpd process traverses the entire routing table, writes the traversal results into the output buffer (vty->obuf), and sends the data in the output buffer to vtysh after the traversal completes. This consumes a large amount of memory when the routing table is large. For example, when there are 1 million routes, the peak memory usage is 78.7 MB.
Now, the traversal task is asynchronous: When the obuf size exceeds the threshold (measured in chunks), the traversal is yiled and an event is registered. The traversal continues only after the obuf size falls below the threshold.
By limiting the traversal based on the obuf size, the memory consumed can be controlled within the desired range. Rewriting the task to be asynchronous ensures that other tasks are not blocked during the data traversal and transmission process.
P.S. My current code depends on #19337 which has not been merged into the master branch yet. So, the first commit, "lib: initial support of yield/resume in vtysh", is from PR 19337, and the second commit, "bgpd: Optimize output buffer memory for show command", is my modification.
hmm, you'll need to follow the developer guide for commits and code before this is going to get attention?
hmm, you'll need to follow the developer guide for commits and code before this is going to get attention?
Thanks for pointing this out, I've revised both the commit and code.
take a second and look over the frrbot results and fix them. Again nothing will be looked at until coding standards have been met.
take a second and look over the frrbot results and fix them. Again nothing will be looked at until coding standards have been met.
Okay, I’ve updated the code.
I think there's a basic issue here, that's worth asking about. This code makes library changes to expose some inner details of the vty library messaging, and then uses those details inside bgp to make decisions. That's a problem in a couple of ways. The benefit of being able to yield during a long-running show command at scale is at least as important as tightly managing memory footprint, at scale. Allowing the main pthread of bgp to mix other work in while handling a long-running show command is very valuable. The decision-making about yielding should be done by using bgp information, imo - by deciding when "enough" show command work has been done, and it's time to yield. Doing a reach-around to the guts of the vty transport isn't really necessary - and it could prevent yielding entirely if the reader of the transport is pacing itself just right. I'd rather see a smaller library footprint, no reacharound, and a simple counter inside bgp. If you really think there's value in tightly controlling the vty sending end, that could be added inside the vty library callback. When the resume event is scheduled, the vty library code could take a look at its transport and decide to defer if necessary. But I really think that unless something is going very wrong, just getting some yields into the show command processing path is going to be almost all of the value.
I think there's a basic issue here, that's worth asking about. This code makes library changes to expose some inner details of the vty library messaging, and then uses those details inside bgp to make decisions. That's a problem in a couple of ways. The benefit of being able to yield during a long-running show command at scale is at least as important as tightly managing memory footprint, at scale. Allowing the main pthread of bgp to mix other work in while handling a long-running show command is very valuable. The decision-making about yielding should be done by using bgp information, imo - by deciding when "enough" show command work has been done, and it's time to yield. Doing a reach-around to the guts of the vty transport isn't really necessary - and it could prevent yielding entirely if the reader of the transport is pacing itself just right. I'd rather see a smaller library footprint, no reacharound, and a simple counter inside bgp. If you really think there's value in tightly controlling the vty sending end, that could be added inside the vty library callback. When the resume event is scheduled, the vty library code could take a look at its transport and decide to defer if necessary. But I really think that unless something is going very wrong, just getting some yields into the show command processing path is going to be almost all of the value.
Thanks for the feedback. To clarify: are you suggesting that using vty_obuf_has_space to expose internal VTY details is inappropriate?
First, even if the sending rate is just right, it will not make the traversal impossible to interrupt. In vty_out, a flush via vtysh_flush is triggered only after the accumulated written bytes reach 128 KB (32 chunks), while the obuf threshold I set is 20 chunks. So during traversal, no sending occurs; data accumulates in obuf until the threshold is reached, which triggers a suspension. Sending then happens at vty_yield. This is effectively equivalent to checking how many entries have been traversed. Even if the threshold is raised above 32, it still does not pose a problem: vtysh_flush sends at most 64 KB (16 chunks) per call, and data will continue to accumulate in obuf.
Second, the suspension condition for the traversal task could be changed to the number of entries processed; however, when resuming traversal, checking obuf is necessary to control memory usage. I currently perform this check in bgp_show_cb; we could move it to yield_resume_cb to reduce exposure of VTY internals.
If we do not check obuf on resume to decide whether to re-suspend or continue, memory consumption ends up the same as the synchronous behavior on the current master branch. For example, if the thread has no other tasks to run, after 1,000 entries the task suspends and registers an event, and the thread immediately schedules that event and continues traversal. In effect, it keeps traversing continuously, with vtysh_flush being called to send data in between. In terms of memory control, this is no different from the current optimization on master (from PR #16772).
The reason that calling vtysh_flush during continuous traversal provides limited memory savings is that traversal/writing to obuf is much faster than inter-process communication (about 10x in my local tests). In the current optimization (PR #16772), vty_out triggers sending after 128 KB of accumulated writes, and vtysh_flush can send at most 64 KB per call. Theoretically, memory usage should drop by 50%, but at larger route scales the actual reduction is only about 10%. Because of the speed gap and the send function is triggered frequently, after enough calls, the socket send buffer fills up. We then must wait for it to drain, during which time vtysh_flush actually sends 0 bytes. As a result, the obuf list “grows at the tail” faster than it is “released at the head,” and a large amount of data still accumulates in obuf during traversal.
In short, the data path is: routing table -> obuf -> socket send buffer. My approach essentially checks obuf to halt traversal and pause writes so that the write rate into obuf matches the socket’s send rate, preventing excessive accumulation in the intermediate obuf. Therefore, I believe checking obuf is necessary; the exact place to perform the check is open to discussion.
yes, no argument that flow-control is good, and buffering endlessly isn't. but I'd rather not start hard-coding decision-making in each daemon's functions; that's hard for us to maintain. I'd rather start with the basic yield/resume changes, ensure that all the corner-cases are covered and that change is robust, and then figure out where to put the flow-control decision-making. I think it would be best in the common library code, but that's just an opinion at this point.
Thanks for the feedback. To clarify: are you suggesting that using vty_obuf_has_space to expose internal VTY details is inappropriate?
yes, no argument that flow-control is good, and buffering endlessly isn't. but I'd rather not start hard-coding decision-making in each daemon's functions; that's hard for us to maintain. I'd rather start with the basic yield/resume changes, ensure that all the corner-cases are covered and that change is robust, and then figure out where to put the flow-control decision-making. I think it would be best in the common library code, but that's just an opinion at this point.
Thanks for the feedback. To clarify: are you suggesting that using vty_obuf_has_space to expose internal VTY details is inappropriate?
I agree with you—how about handling flow control in yield_resume_cb?
yes, no argument that flow-control is good, and buffering endlessly isn't. but I'd rather not start hard-coding decision-making in each daemon's functions; that's hard for us to maintain. I'd rather start with the basic yield/resume changes, ensure that all the corner-cases are covered and that change is robust, and then figure out where to put the flow-control decision-making. I think it would be best in the common library code, but that's just an opinion at this point.
Thanks for the feedback. To clarify: are you suggesting that using vty_obuf_has_space to expose internal VTY details is inappropriate?
I’ve submitted a new commit that moves the flow-control decision-making and updates some related code accordingly. Could you take a look and see if this approach is appropriate?
With this change, obuf memory usage control is a bit more relaxed. The peak memory usage equals VTY_OBUF_LIMIT plus the size of show_yield_limit routing table entries.
This pull request has conflicts, please resolve those before we can evaluate the pull request.
rebase
Hi @riw777 , I noticed that this PR is now marked "do not merge".
Could you let me know the reason and what I should change to move it forward? Thanks!
This pull request has conflicts, please resolve those before we can evaluate the pull request.
rebased to new master, resolved conflicts
I've removed the "don't merge" label and just made this a draft - it includes some lib work that will come in from another PR, but it's easier to see the CI and checks status without that label.