tetragon icon indicating copy to clipboard operation
tetragon copied to clipboard

exporter: fix flaky Test_rateLimitExport with polling

Open calghar opened this issue 1 month ago • 1 comments

Summary

This PR fixes the flaky Test_rateLimitExport test by replacing the fixed sleep duration with a polling-based synchronization mechanism, addressing the timing race condition reported in #2789.

Related Issue

Fixes #2789

Root Cause

The test used a fixed 200ms sleep which allowed multiple rate limiter ticker intervals (50ms each) to fire during the wait period. This created a timing window where:

  • Multiple tickers could emit rate-limit-info messages if events were still processing
  • A race condition at ticker boundaries could cause off-by-one errors in event counts
  • No synchronization existed between "events sent" and "events fully processed"

Proposed Changes

  • Added countEvents() helper function: Non-blocking function to count events and rate-limit-info messages without assertions
  • Replaced fixed sleep with polling: Poll every 10ms until expected number of events and rate-limit-info messages are received
  • Added timeout with clear diagnostics: 500ms timeout (2.5× original sleep) with descriptive error message showing actual vs. expected counts

Testing Performed

  • Test passes 20 consecutive runs without failures (verified via Docker with golang:1.25)
  • No performance degradation - typically should complete faster than the original
  • Deterministic behavior regardless of system timing or load

Backward Compatibility

No breaking changes. The fix only modifies the test implementation, not the rate limiter functionality itself.

Changelog


calghar avatar Oct 31 '25 13:10 calghar

This PR fixes a flaky test with no user-facing changes. Could a maintainer please add the release-note/misc label? Thanks!

calghar avatar Oct 31 '25 13:10 calghar