bouncer metrics are adding significant load on OpenWrt router
What happened?
When updating the bouncer to version 0.0.29-rc3 on my OpenWrt router (BananaPi R3) I realized that the bouncer is adding a significant load on the router when metrics are enabled. Usually the router has a very low load, it has a powerful processor. The load added is by 10 - 20% on all 4 cpus only due to the metrics. This on a device usually at under 4%.
What did you expect to happen?
I'd like to not add additional load of this amount to the device when collecting metrics.
How can we reproduce it (as minimally and precisely as possible)?
Install OpenWrt on BPi R3 with luci-app-crowdsec-bouncer and enable metrics.
Anything else we need to know?
Most routers are devices with limited power but metrics are important. Before enabling the metrics in the bouncer the firewall rules already counted blocked packets and bytes. Thus it seems clear that the load must be caused by another value or the process of counting metrics itself.
I suggest to provide a setting for a limited number of metrics which should only include the number of dropped packets and bytes. None of the go internals is really necessary to know on normal operations.
The number of banned ips should be available on the lapi already. I don't know how this number is calculated and if it may be the cause of the load (counting the ips in the set?) but this should be tested and if the number of elements in the sets is not causing the load it may also be added to the limited set of metrics.
Also: it seems as if the metrics are continuously collected.
Maybe something like this may be a solution:
- disabling metrics disables the continuously collecting of metrics (as it is now)
- provide the /metrics endpoint nevertheless
- if /metrics is called, collect the metrics on-demand (maybe only limited set available if disabled) or/and
- add an optional parameter to the /metrics endpoint defining the set of metrics to collect (limited, full)
This would enable prometheus to collect only the needed metrics and would prevent unecessary load on the device. By chosing the interval on the callers side the load can also be reduced.
version
remediation component version:
$ crowdsec-firewall-bouncer --version
0.0.29-rc3
crowdsec version
crowdsec version:
$ crowdsec --version
1.6.4
OS version
# On Linux:
$ cat /etc/os-release
OpenWrt 23.05.4
$ uname -a
Linux BPI-R3-eth1 5.15.162 #0 SMP Mon Jul 15 22:14:18 2024 aarch64 GNU/Linux
@ne20002: Thanks for opening an issue, it is currently awaiting triage.
In the meantime, you can:
- Check Documentation to see if your issue can be self resolved.
- You can also join our Discord
Details
I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.
Hello,
We are reworking the way we collect metrics / add more metrics in this PR: https://github.com/crowdsecurity/cs-firewall-bouncer/pull/365 (the goal is to provide those metrics as well in cscli metrics, and have more granular data about which decision source blocks what).
We have optimized the way we collect metrics for both nftables (no more calls to the nft binary) and iptables (a single call to iptables-save), would you mind trying the PR to see if you still see such an impact ? ()
Hi @blotus Thank you for the response. Unfortunately I don't have a go dev environment runnig nor any experience with go. But I see that you already did add a 'only compute metrics when requested' commit to the PR.
Hello,
I've just merged #365, and it should be released this week.
There are some significant changes in how we collect metrics:
- Reduced ticker usage: metrics are now only collected when prometheus makes a request to the endpoint and every ~20 minutes so we can send some statistics for crowdsec to display in
cscliand the web console. - In
iptables(oripset) mode, the counter values are now fetched withiptables-save, which is faster and easier to parse - In
nftablesmode, we now use embedded counters in the rules we add, which remove the need to call thenftbinary, and is much faster. - Because of those changes, metrics are always collected at least once every 20 minutes even if the prometheus endpoint is disabled (from our testing, it takes about 100-200ms to collect the metrics with more than 100k banned IPs, this will of course change depending on the CPU, but this should be almost invisible).
I'm going to close the issue, but feel free to reopen it if you still see huge CPU usage after the release.