LibreQoS mikrotik tree

As near as I can tell, most of libreqos, targetting linux htb + cake, is actually also exportable in a mikrotik format, just using their configuration keywords to generate a conf file. No xdp, obviously, but...

Sep 25 '22 17:09 dtaht

Is Mikrotik affected by single-lock issue?

Sep 25 '22 20:09 interduo

I'm not seeing many problems other than even the best CPU being fairly slow for shaping. They're going for a scale out approach with their annapurna labs network CPUs. I have CCR2116 in production and they are a beast of a router but an i5-2400 quad core from the stone age can handle more bits for traffic shaping.

For small operators it would be pretty neat to run libreqos as an OOB service, pushing configs to mikrotik via API. Especially if you were virtualizing your libreqos off-prem. Great for small operators in bandwidth deprived areas that may not have the resources to put in a PC or nice NICs etc etc. A hAP ac2 can handle a couple hundred Mbps in cake, hAP ax2 nearly double that.

Oct 09 '22 15:10 syadnom

Could be neat. Is there any way to pull real qdisc stats our of mikrotik? That's how we collect bandwidth stats at the moment.

Oct 09 '22 16:10 rchac

here's a little queue tree on routeros v7.5 with fqcodel. Not a ton of info, bytes, dropped packets, queue packets available.

Oct 09 '22 16:10 syadnom

I am far from convinced they are actually collecting drop stats from fqcodel or cake at all. Can you saturate a link (a udp flood will do, so would a ping -f) and see if you get anything?

Oct 09 '22 17:10 dtaht

fqcodel and cake do not populate the dropped. fifo queues do.

Image is both routers (mikrotik rb5009) across a 10G SFP+ port with a 5x5 cake shaper and both sides running a UDP bandwidth test targetting each other across the shaper. Shows queued packets but I can't convince it to drop anything.

Oct 09 '22 17:10 syadnom

same test with fifo. so this confirms at least on 7.5 that drops are not being tracked in an accessible way on fqcodel or cake. It DOES work for every other queue type, fifo, sfq, red.

Oct 09 '22 17:10 syadnom

I don't have a relationship with mikrotik, can you bug report this? Also, request ecn marks? seeing ever more of those...

While I'm making feature requests in the wrong forum... it would be so great if we could do inbound shaping with pure cake on what I think is called the "interface queue". It's only 4 lines of code to do this in sqm....

Oct 09 '22 17:10 dtaht

bug report submitted.

interface queues work in mikrotik. so do bridge shapers if you turn on 'ip firewall' on a bridge.

I altered that last test to show this. I also set it to 5x11 just so you can see the difference present in the throughput:

Oct 09 '22 18:10 syadnom

Test on TCP but just one test running, just because it's easier to read than my convoluted dual bandwidth test.

Oct 09 '22 18:10 syadnom

figured since I'm running tests, might as well show rb5009 potential here.

UDP one way cake basically maxed with the mikrotik bandwidth test doing to work. There's a little more to be extracted here if running iperf on a separate box. Note, this is using all CPU cores so anything that'll lock things to a single care will be 1/4 of these. ~2.1Gbps

TCP tests about 3.8Gbps. Same stipulations as above. I think the TCP test itself on this particular hardware runs better than the UDP test. I suspect iperf would correct the UDP/TCP descrepancy.

This is a Marvell Armada CPU (4 core 1.4Ghz). The Annapurna models in the CCR2xxx series are even faster (16 core 2Ghz) and I believe better IPC.

queue tree cannot do interface matching, it hangs off the global matcher (interface agnostic) and relies on packet marks on child queues.

Oct 09 '22 18:10 syadnom

I am not sure if we are talking past each other or not. The simple queues feature leverages tbf + cake or fq_codel to do the shaping inbound or outbound or both. An interface queue can shape outbound only (via cake's bandwidth parameter), but not inbound. In general, I prefer a world where the cpe shapes outbound -> ISP and ISP shapes inbound -> CPE as that does the bottleneck detection and smartest drop, ack-drop, and rescheduling possible.

However historically, since the ISPs were slow to move we saw the rise of middleboxes like preseem and now libreqos - doing it both ways, and also individuals doing it on their own routers, shaping inbound as well. It's only four commands in linux

ip link add name SQM_IFB_050ec type ifb
tc qdisc replace dev SQM_IFB_050ec root cake bandwidth whatever
tc filter add dev $IFACE parent ffff: protocol all prio 10 u32 \
        match u32 0 0 flowid 1:1 action mirred egress redirect dev $DEV
ip link set dev dev SQM_IFB_050ec up

The deficit style shaper in cake we use is more efficient than tbf in most circumstances, and most importantly never bursts. So if somehow mikrotik could support this in the interface queues, it would be a win. As it stands, I'm fairly content to just shape at the cpe outbound using the bandwidth parameter.

cake is also intensely programmable via tc filters also.

Oct 09 '22 18:10 dtaht

The rb5009 looks really attractive! the ccr2xxxx even more so... Btw, I am unsure that a single iperf flow on a single box can crack 4Mbit in the first place, due to running out of local buffer space on either tx or more likely rx. Hit it with 4 or more?

A fifo, no shaping, can do what on this hardware? an interface queue of fq_codel? cake no shaping cake shaped to 2Gbit? cake shaped via the toke bucket?

I of course am a really big fan of flent, especially the rrul test - which uses netperf. Shoulda ported it to iperf, too.

token buckets do have the advantage of an easy offload to hardware which I think mikrotik is doing in many cases.

Oct 09 '22 18:10 dtaht

(and thx very much for filing the bug report! Does it have a number? I can go be a pest elsewhere...)

Oct 09 '22 18:10 dtaht

mikrotik report SUP-94551

for the 'interface' shaping on mikrotik, that presents as a bi-directional shaper so must not be 'interface' in the same context.

pfifo with a 500 packet buffer can do 4.3Gbps UDP one way, 4.8Gbps TCP one way. ~97% CPU.

Cake with no limits, about 4.4Gbps Cake at 2Gbps, ~70% CPU top level fifo 'unlimited' to any child cake shaper same results for the single stream test. the fifo wide open/no shaper TCP 5.1Gbps, UDP 8.7Gbps.

Keep in mind that according to mikrotik's not-so-great task manager, my bandwidth test is using up ~6% of the CPU. Also, UDP wide open w/ the test running at 8.7Gbps is ~80% CPU. There is more headroom here taking the bandwidth generator/receiver off-device. Also, mikrotik's bandwidth test is pretty primative. I wouldn't count on these numbers with any precision. That said, for a small 1-2Gbps aggregate operator, this hardware and a well designed queue tree is pretty legit.

Oct 09 '22 23:10 syadnom

I poked into the mvpp2 switch driver. No BQL, full support for XDP, strong candidate for openwrt + XDP + BQL (6 lines of new code), might be able to push line rate.

Oct 10 '22 01:10 dtaht

I don't know that openwrt has been successfully run on these yet, likely just lack of avialability from someone who likes jamming openwrt into various hardware.

Oct 10 '22 14:10 syadnom

There are multiple active efforts over here: https://forum.openwrt.org/t/add-support-for-mikrotik-rb5009ug/104391/760 - people are wrestling with cpu governors and the port to 5.15 presently, but I expect that to get sorted out the more folk leap on it.

Oct 10 '22 16:10 dtaht

Coming in here a couple weeks late, but I've been lurking between here and the MikroTik forums while I keep kicking the QoS can down the road. After testing Cake and fq-codel on a few hAP AC2's and AC3's at customer homes, I'm impressed with the results and looking to deploy shaping network-wide.

My LibreQoS box has been in line for a couple of months, my only hold-out being the UISP data being out of sync with reality (lots of MikroTik radios and CPE that I have yet to manually insert). It's twiddling its thumbs waiting for me to make it do something.

But over the weekend, I just upgraded my CCR1036 to 7.6. It handles CGNAT for roughly 500 devices, passing 1-2Gbps all day long and sits upstream of the Libre box. The improvements from 6.47 to 7.6 are enough to drop the CPU load from an average of 2-5% to 0% with the same amount of traffic (2Gbps). That leads me to believe, with its 36 1.4GHz cores, that it could easily handle shaping all of these queues, if we had LibreQoS pushing queue scripts to it instead of running tc locally. (Plus, my Libre box is just a NUC with a Thunderbolt cage for the Intel card...)

Roughly 66% of my customers have routers I've installed that can run RouterOS 7. Similar scripts could be run to deploy shaping on the CPE directly, especially in the upload direction. Even more cool would be polling UISP's radio stats (LTU upload bandwidth in particular) and updating the router's upload max to match.

(Interestingly, we did all this 20 years ago.. it's even patented.)

Oct 31 '22 22:10 SirBryan

Did Mikrotik ever manage to get queue trees out from the giant lock (that kept them firmly stuck on 1 CPU, no matter how many you have)? If that's resolved in 7.x, then feeding topologies into RouterOS shouldn't be too bad. In the 6.x line, a big queue tree was a sure way to bring a router to its knees.

Nov 01 '22 14:11 thebracket

Did Mikrotik ever manage to get queue trees out from the giant lock (that kept them firmly stuck on 1 CPU, no matter how many you have)? If that's resolved in 7.x, then feeding topologies into RouterOS shouldn't be too bad. In the 6.x line, a big queue tree was a sure way to bring a router to its knees.

Pretty much still stuck there and really... always will be. Super inefficient to migrate data between CPU cores so a top level queue is pretty much stuck this way. This is essentially everyone's problem.

That said, since you do have lots of cores in some of these boxes what might be nice is to create queue trees for each backhaul to at least spread that load out, then maybe monitor these and dynamically update the parent shaper for each backhaul with the intent of keeping the primary uplink from getting congested yet getting the most possible data out. Might even have multiple 'top level' shapers on a backhaul to handle the tree from the secondary hop as well, so long as there's something monitoring and adjusting the main box.

Ultimately, a ridiculously fast CPU and a single top level shaper would be best, but I think we're kinda kitting a single core CPU wall here.

Message ID: @.***>

Nov 01 '22 14:11 syadnom

inefficient to migrate data between CPU cores so a top level queue is pretty much stuck this way. This is essentially everyone's problem.

It really doesn't have to be - see the xdp-cpumap-tc project that powers LibreQos. It basically lets you decide which CPU gets which packet (by IP), and then steers it to the right part of a per-CPU queue tree. Very fast, and pretty flexible for steering your CPU load. :-) Mikrotik could do something like that; it's basically a Linux core, so I don't think it would be a stretch for them to use some eBPF magic. It's all open source, they are welcome to join the party!

I could definitely see some utility in a system that feeds queue data into Mikrotik routers; that risks neglecting the best part of LibreQoS (the shaper) and mostly using the integration APIs with a different back-end. I keep musing about having Cake queues on downstream routers, helping reduce buffer bloat along the line. (e.g. "Tower X has 65 Mbps of upstream (number chosen at random), use Cake to de-bloat that 65 Mbps and let LibreQoS at the core handle the bigger picture"). I haven't got beyond "I wonder if that would help?"

Nov 01 '22 14:11 thebracket

xdp-cpumap-tc doesn't eliminate the latency between CPUs and caches. On an intel 11th gen CPU this is just under 30ns more latency per fetch.

On Tue, Nov 1, 2022 at 8:47 AM thebracket @.***> wrote:

inefficient to migrate data between CPU cores so a top level queue is pretty much stuck this way. This is essentially everyone's problem.

It really doesn't have to be - see the xdp-cpumap-tc project that powers LibreQos. It basically lets you decide which CPU gets which packet (by IP), and then steers it to the right part of a per-CPU queue tree. Very fast, and pretty flexible for steering your CPU load. :-) Mikrotik could do something like that; it's basically a Linux core, so I don't think it would be a stretch for them to use some eBPF magic. It's all open source, they are welcome to join the party!

I could definitely see some utility in a system that feeds queue data into Mikrotik routers; that risks neglecting the best part of LibreQoS (the shaper) and mostly using the integration APIs with a different back-end. I keep musing about having Cake queues on downstream routers, helping reduce buffer bloat along the line. (e.g. "Tower X has 65 Mbps of upstream (number chosen at random), use Cake to de-bloat that 65 Mbps and let LibreQoS at the core handle the bigger picture"). I haven't got beyond "I wonder if that would help?"

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/120#issuecomment-1298616137, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZOKCZ4AVEA673I3XXDWGEUOJANCNFSM6AAAAAAQVF2QVI . You are receiving this because you commented.Message ID: @.***>

Nov 01 '22 15:11 syadnom

Obviously, but it elides the giant lock that keeps tc from spreading out on its own. A couple of hundred nanoseconds (even 3000 nanoseconds) is a really small price to pay if it then lets you spread the heavy-lifting (HTB, cake, etc.) over a large number of cores. It's all about amortizing your costs.

It's also not 30ns per fetch; the first fetch between one core's L1 cache is slow, after that it's almost guaranteed to be in the local core's cache. If you maintain locality from assigning the core (in the XDP program), through the TC bpf program and then the shaper itself you do far better than that. Otherwise, I wouldn't be timing the XDP programs under load (just under 5gbit/s) as low as 60 ns (and very occasionally as high as 3000 ns) - including two slow clock reads and a slow text format/kernel debug pipe output to obtain those numbers. Admittedly, there's a ton of work there to read forwards and avoid pointer chasing where possible.

Comparing that cost (and scanning the packet headers on the destination core has a lovely side-effect of pretty much ensuring that it's in L1/2 cache on the correct core when Cake/HTB runs - by reading the packet header on the correct core) versus having to run everything on a single core (while the rest idle), it's pretty obvious which will give you greater overall performance. (Mikrotik are even half way there, letting you pin NIC interrupt queues to cores - which can significantly improve performance if you get it right!)

Nov 01 '22 15:11 thebracket

I don't know that 3ms is actually a small price...

And it really can't just be one fetch, because then you've just moved processing over to that core. it's a fetch every single time because you're having to keep data between both CPUs in sync. Further, during that fetch every core involved is at rest.

The point about putting NIC interrupts to specific cores is exactly the point. You get a dramatic increase in performance by not copying data between cores. or a rather dramatic loss if you do...

On Tue, Nov 1, 2022 at 9:21 AM thebracket @.***> wrote:

Obviously, but it elides the giant lock that keeps tc from spreading out on its own. A couple of hundred nanoseconds (even 3000 nanoseconds) is a really small price to pay if it then lets you spread the heavy-lifting (HTB, cake, etc.) over a large number of cores. It's all about amortizing your costs.

It's also not 30ns per fetch; the first fetch between one core's L1 cache is slow, after that it's almost guaranteed to be in the local core's cache. If you maintain locality from assigning the core (in the XDP program), through the TC bpf program and then the shaper itself you do far better than that. Otherwise, I wouldn't be timing the XDP programs under load (just under 5gbit/s) as low as 60 ns (and very occasionally as high as 3000 ns) - including two slow clock reads and a slow text format/kernel debug pipe output to obtain those numbers. Admittedly, there's a ton of work there to read forwards and avoid pointer chasing where possible.

Comparing that cost (and scanning the packet headers on the destination core has a lovely side-effect of pretty much ensuring that it's in L1/2 cache on the correct core when Cake/HTB runs - by reading the packet header on the correct core) versus having to run everything on a single core (while the rest idle), it's pretty obvious which will give you greater overall performance. (Mikrotik are even half way there, letting you pin NIC interrupt queues to cores - which can significantly improve performance if you get it right!)

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/120#issuecomment-1298694536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZLG7MJAUNE6J57UXJ3WGEYN7ANCNFSM6AAAAAAQVF2QVI . You are receiving this because you commented.Message ID: @.***>

Nov 01 '22 15:11 syadnom

3000ns is just 0.003ms though. If we compare that to the increased latency a Mikrotik router will introduce to its forwarded traffic if a CPU core gets choked up by queues, that 0.003ms seems negligible, no?

Nov 01 '22 15:11 rchac

30 ns - nanoseconds. Thats 3e-5 ms - a really small number.

Nov 01 '22 15:11 thebracket

My ambition was to get the RB5009 up on openwrt, with the BQL patches, and to try xdp-cpumap on that. The testing stalled out inconclusively, but in terms of testing everything but BQL, xdp, cake, shaping etc, a potential revolution is just a reflash away...

https://forum.openwrt.org/t/add-support-for-mikrotik-rb5009ug/104391/812

Nov 01 '22 18:11 dtaht

@SirBryan

But over the weekend, I just upgraded my CCR1036 to 7.6. It handles CGNAT for roughly 500 devices, passing 1-2Gbps all day long and sits upstream of the Libre box. The improvements from 6.47 to 7.6 are enough to drop the CPU load from an average of 2-5% to 0% with the same amount of traffic (2Gbps).

Impressive.

That leads me to believe, with its 36 1.4GHz cores, that it could easily handle shaping all of these queues, if we had LibreQoS pushing queue scripts to it instead of running tc locally. (Plus, my Libre box is just a NUC with a Thunderbolt cage for the Intel card...)

However queue trees are very cpu intensive. I'd love merely to know what happens if you slam fq_codel (or cake) on each of those interfaces running native, and further, if it has any observable effect.

Roughly 66% of my customers have routers I've installed that can run RouterOS 7. Similar scripts could be run to deploy shaping on the CPE directly, especially in the upload direction.

+10. cake was designed primarily (originally) to run on the interface directly with it's own shaper, diffserv, ack-filter, and nat awareness. It is the right place, to stick it on the edge CPE, wherever possible. I urge you to start deploying it.

Even more cool would be polling UISP's radio stats (LTU upload bandwidth in particular) and updating the router's upload max to match.

While this is a good thing, I've punted your requests to the v1.4 release. The v1.3 release is hits beta nov 15th, would it be possible for you to test it on a small portion of your existing network?

Nov 04 '22 16:11 dtaht

The CCR10xx series has a lot of really poor general purpose CPUs, basically 20 year old PowerPC cores with a network processor on top (TILE/Tileara). Great for routing, terrible for shaping. Well under 1Gbps shaping with cake. hAP AC2 has a vastly superior chip for shaping, at least twice as fast if not more.

CCR2xxx series are much much better Annapurna ARM CPU and often a Marvella hardware routing chip onboard as well.

On Fri, Nov 4, 2022 at 10:52 AM Dave Täht @.***> wrote:

@SirBryan https://github.com/SirBryan

But over the weekend, I just upgraded my CCR1036 to 7.6. It handles CGNAT for roughly 500 devices, passing 1-2Gbps all day long and sits upstream of the Libre box. The improvements from 6.47 to 7.6 are enough to drop the CPU load from an average of 2-5% to 0% with the same amount of traffic (2Gbps).

Impressive.

That leads me to believe, with its 36 1.4GHz cores, that it could easily handle shaping all of these queues, if we had LibreQoS pushing queue scripts to it instead of running tc locally. (Plus, my Libre box is just a NUC with a Thunderbolt cage for the Intel card...)

However queue trees are very cpu intensive. I'd love merely to know what happens if you slam fq_codel (or cake) on each of those interfaces running native, and further, if it has any observable effect.

Roughly 66% of my customers have routers I've installed that can run RouterOS 7. Similar scripts could be run to deploy shaping on the CPE directly, especially in the upload direction.

+10. cake was designed primarily (originally) to run on the interface directly with it's own shaper, diffserv, ack-filter, and nat awareness. It is the right place, to stick it on the edge CPE, wherever possible. I urge you to start deploying it.

Even more cool would be polling UISP's radio stats (LTU upload bandwidth in particular) and updating the router's upload max to match.

While this is a good thing, I've punted your requests to the v1.4 release. The v1.3 release is hits beta nov 15th, would it be possible for you to test it on a small portion of your existing network?

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/120#issuecomment-1303871135, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZONRGAJ22UIX7EWDKTWGU5OTANCNFSM6AAAAAAQVF2QVI . You are receiving this because you commented.Message ID: @.***>

Nov 04 '22 17:11 syadnom

LibreQoS LibreQoS copied to clipboard

mikrotik tree

LibreQoS
LibreQoS copied to clipboard