tetragon
tetragon copied to clipboard
Ratelimit metrics are not working as expected in v1.3.0
What happened?
We upgraded tetragon to v1.3.0.
The tetragon_export_ratelimit_events_dropped_total (which got renamed from tetragon_ratelimit_dropped_total in this release) for tracking number of rate-limited events does not seem to work as expected. The metric exists but results in 0 value.
Log output included below.
Fix: I haven't looked into the codebase yet but I suspect some part of code got missed while renaming the metric.
Impact: We don't have visibility into how well our rate-limit filters are working and that makes it hard to tune the policy correctly.
Tetragon Version
v1.3.0
Kernel Version
6.8.0-1016-aws
Kubernetes Version
Server version 1.30
Bugtool
No response
Relevant log output
tetragon_export_ratelimit_events_dropped_total{account="<snip>", cluster="<snip>", cluster_group="<snip>", endpoint="metrics", instance="<snip>", job="tetragon", namespace="<snip>", node="<snip>", pod="<snip>", prometheus="<snip>", provider="<snip>", service="tetragon", tetragon_pod="<snip>", vpc="<snip>", zone="<snip>"} - 0
Anything else?
No response
Hey, so this is the commit that did the rename 184825310c3a ("exporter: Rename ratelimit drops counter"), and here's the full PR that includes some other changes https://github.com/cilium/tetragon/pull/2890 around that rename. This commit moves the place of incrementation but very slightly: 52310935bc7f ("Move ratelimitmetrics inside pkg/exporter").
Here's an extract from the diff: it doesn't fundamentally change the code, instead of being called in the Drop() function itself, it's now called from the caller, after the drop occured.
diff --git a/pkg/exporter/exporter.go b/pkg/exporter/exporter.go
index bc8bf0729..2d21a0e70 100644
--- a/pkg/exporter/exporter.go
+++ b/pkg/exporter/exporter.go
@@ -56,6 +56,7 @@ func (e *Exporter) Start() error {
func (e *Exporter) Send(event *tetragon.GetEventsResponse) error {
if e.rateLimiter != nil && !e.rateLimiter.Allow() {
e.rateLimiter.Drop()
+ rateLimitDropped.Inc()
return nil
}
diff --git a/pkg/ratelimit/ratelimit.go b/pkg/ratelimit/ratelimit.go
index 404395225..52c5512a4 100644
--- a/pkg/ratelimit/ratelimit.go
+++ b/pkg/ratelimit/ratelimit.go
@@ -78,5 +77,4 @@ func (r *RateLimiter) reportRateLimitInfo(encoder encoder.EventEncoder) {
func (r *RateLimiter) Drop() {
atomic.AddUint64(&r.dropped, 1)
- ratelimitmetrics.RateLimitDropped.Inc()
}
My questions would be: did this ever work for you (seems like yes since you mention an upgrade)? If yes, for which version/commit? And how do you test this so we can try to reproduce on our side? Thanks!
We were on 1.1.2 previously this week, and turns out this metric was also 0 at that point as well. Seems as if the metric Inc() doesn't actually get called by the agent when a ratelimit is triggered. We do see very nice spikes when the ratelimit kicks in, but tetragon_export_ratelimit_events_dropped_total is always 0 from our scrapes.
IIRC, I don't think we were using ratelimit filters while on 1.1.2 so hard to say if the metric worked or not.
Hey @kevsecurity did you observe this metric to work?
Hey @kevsecurity did you observe this metric to work?
Not looked at it! Maybe @lambdanis knows about it?
@pratiklotia @sp3nx0r Did you observe rate limiting itself working as expected? I'm not sure if this metric worked correctly before, I'll check.
@lambdanis When we add ratelimit actions to our TracingPolicies, we see the number of exported events go down. It goes further down when the ratelimit window is wider (say 1m instead of 10s). That indicates that ratelimit is working as expected.
Event graphs sawtooth with the ratelimiting so that's definitely working. But the ratelimiting metric remains 0 (it's been 0 "forever" it seems based on our metric retention)
When we add ratelimit actions to our TracingPolicies, we see the number of exported events go down.
I see, the gotcha here is that Tetragon provides two different rate limiting mechanisms. You're using rate limits embedded in TracingPolicy, but this is not what tetragon_export_ratelimit_events_dropped_total metric measures. The metric counts events rate limited on export - this is configured with export-rate-limit flag or tetragon.exportRateLimit Helm value. That's why the metric got renamed actually - to make it tied to export, not policy.
There is no metric reporting events rate limited by a policy at the moment I'm afraid. It definitely would be useful, so feel free to open an issue to add it, or a PR.
Given lambdanis last reply, I'll close this issue as it seems to be solved.