tetragon icon indicating copy to clipboard operation
tetragon copied to clipboard

Ratelimit metrics are not working as expected in v1.3.0

Open pratiklotia opened this issue 9 months ago • 9 comments

What happened?

We upgraded tetragon to v1.3.0.

The tetragon_export_ratelimit_events_dropped_total (which got renamed from tetragon_ratelimit_dropped_total in this release) for tracking number of rate-limited events does not seem to work as expected. The metric exists but results in 0 value.

Log output included below.

Fix: I haven't looked into the codebase yet but I suspect some part of code got missed while renaming the metric.

Impact: We don't have visibility into how well our rate-limit filters are working and that makes it hard to tune the policy correctly.

Tetragon Version

v1.3.0

Kernel Version

6.8.0-1016-aws

Kubernetes Version

Server version 1.30

Bugtool

No response

Relevant log output

tetragon_export_ratelimit_events_dropped_total{account="<snip>", cluster="<snip>", cluster_group="<snip>", endpoint="metrics", instance="<snip>", job="tetragon", namespace="<snip>", node="<snip>", pod="<snip>", prometheus="<snip>", provider="<snip>", service="tetragon", tetragon_pod="<snip>", vpc="<snip>", zone="<snip>"} -  0

Anything else?

No response

pratiklotia avatar Feb 27 '25 21:02 pratiklotia

Hey, so this is the commit that did the rename 184825310c3a ("exporter: Rename ratelimit drops counter"), and here's the full PR that includes some other changes https://github.com/cilium/tetragon/pull/2890 around that rename. This commit moves the place of incrementation but very slightly: 52310935bc7f ("Move ratelimitmetrics inside pkg/exporter").

Here's an extract from the diff: it doesn't fundamentally change the code, instead of being called in the Drop() function itself, it's now called from the caller, after the drop occured.

diff --git a/pkg/exporter/exporter.go b/pkg/exporter/exporter.go
index bc8bf0729..2d21a0e70 100644
--- a/pkg/exporter/exporter.go
+++ b/pkg/exporter/exporter.go
@@ -56,6 +56,7 @@ func (e *Exporter) Start() error {
 func (e *Exporter) Send(event *tetragon.GetEventsResponse) error {
        if e.rateLimiter != nil && !e.rateLimiter.Allow() {
                e.rateLimiter.Drop()
+               rateLimitDropped.Inc()
                return nil
        }
diff --git a/pkg/ratelimit/ratelimit.go b/pkg/ratelimit/ratelimit.go
index 404395225..52c5512a4 100644
--- a/pkg/ratelimit/ratelimit.go
+++ b/pkg/ratelimit/ratelimit.go
@@ -78,5 +77,4 @@ func (r *RateLimiter) reportRateLimitInfo(encoder encoder.EventEncoder) {

 func (r *RateLimiter) Drop() {
        atomic.AddUint64(&r.dropped, 1)
-       ratelimitmetrics.RateLimitDropped.Inc()
 }

My questions would be: did this ever work for you (seems like yes since you mention an upgrade)? If yes, for which version/commit? And how do you test this so we can try to reproduce on our side? Thanks!

mtardy avatar Feb 28 '25 10:02 mtardy

We were on 1.1.2 previously this week, and turns out this metric was also 0 at that point as well. Seems as if the metric Inc() doesn't actually get called by the agent when a ratelimit is triggered. We do see very nice spikes when the ratelimit kicks in, but tetragon_export_ratelimit_events_dropped_total is always 0 from our scrapes.

Image

sp3nx0r avatar Feb 28 '25 21:02 sp3nx0r

IIRC, I don't think we were using ratelimit filters while on 1.1.2 so hard to say if the metric worked or not.

pratiklotia avatar Mar 01 '25 21:03 pratiklotia

Hey @kevsecurity did you observe this metric to work?

mtardy avatar Mar 03 '25 09:03 mtardy

Hey @kevsecurity did you observe this metric to work?

Not looked at it! Maybe @lambdanis knows about it?

kevsecurity avatar Mar 03 '25 11:03 kevsecurity

@pratiklotia @sp3nx0r Did you observe rate limiting itself working as expected? I'm not sure if this metric worked correctly before, I'll check.

ghost avatar Mar 04 '25 18:03 ghost

@lambdanis When we add ratelimit actions to our TracingPolicies, we see the number of exported events go down. It goes further down when the ratelimit window is wider (say 1m instead of 10s). That indicates that ratelimit is working as expected.

pratiklotia avatar Mar 05 '25 15:03 pratiklotia

Event graphs sawtooth with the ratelimiting so that's definitely working. But the ratelimiting metric remains 0 (it's been 0 "forever" it seems based on our metric retention)

Image

Image

sp3nx0r avatar Mar 05 '25 16:03 sp3nx0r

When we add ratelimit actions to our TracingPolicies, we see the number of exported events go down.

I see, the gotcha here is that Tetragon provides two different rate limiting mechanisms. You're using rate limits embedded in TracingPolicy, but this is not what tetragon_export_ratelimit_events_dropped_total metric measures. The metric counts events rate limited on export - this is configured with export-rate-limit flag or tetragon.exportRateLimit Helm value. That's why the metric got renamed actually - to make it tied to export, not policy.

There is no metric reporting events rate limited by a policy at the moment I'm afraid. It definitely would be useful, so feel free to open an issue to add it, or a PR.

ghost avatar Mar 06 '25 10:03 ghost

Given lambdanis last reply, I'll close this issue as it seems to be solved.

mtardy avatar Jul 15 '25 16:07 mtardy