beats icon indicating copy to clipboard operation
beats copied to clipboard

[Filebeat]: aws-s3 input - Add Lexicographical Ordering Mode

Open andrewkroh opened this issue 3 weeks ago • 2 comments

Problem

The current S3 bucket polling mode works universally but requires tracking every processed object, which becomes challenging for the most common AWS log sources (CloudTrail, VPC Flow Logs, ALB/NLB Access Logs, S3 Access Logs, CloudFront Logs). These services generate logs continuously, and as buckets accumulate objects over time, users experience:

  • Growing memory usage and slower performance
  • Increasing API costs as more objects accumulate
  • More complex state management

Opportunity

The most common AWS log sources write to S3 with lexicographically-ordered object keys due to their timestamp-based naming conventions (e.g., YYYY/MM/DD paths or YYYYMMDDTHHmmZ timestamps). This ordering pattern can be leveraged to make monitoring these sources simpler and more efficient.

The S3 ListObjectsV2 API's start-after parameter enables optimized incremental polling specifically designed for these common AWS log sources:

  • List only new objects since the last poll (no need to scan entire bucket)
  • Use a small bounded buffer instead of tracking all objects
  • Reduce API calls dramatically (typically 1-10 per poll vs hundreds/thousands)

Proposed Solution

Configuration

filebeat.inputs:
- type: aws-s3
  bucket_arn: arn:aws:s3:::my-cloudtrail-bucket
  bucket_list_prefix: AWSLogs/123456789012/CloudTrail/
  lexicographical_ordering: true
  lexicographical_lookback_keys: 100  # Required: buffer size for out-of-order protection

Technical Requirements

  1. Lookback Buffer (Mandatory): Maintain a bounded buffer of the N most recently processed keys (default: 100). The start-after parameter uses the lexicographically oldest key in the buffer, creating a sliding window that catches late-arriving objects.

  2. State Storage: Store only the lookback buffer (~20 KB for default 100 keys) instead of all processed objects, providing O(1) bounded memory regardless of total objects processed.

  3. Out-of-Order Handling: The lookback buffer ensures objects that arrive out of lexicographical order (due to late delivery, clock skew, or retries) are still detected and processed.

  4. API Usage: Use ListObjectsV2 with start-after parameter set to the oldest key in the lookback buffer. The AWS SDK's paginator will automatically handle continuation-token for pagination when there are more than 1,000 new objects.

Benefits

This mode makes it easier and more reliable to monitor common AWS log sources:

  • Simpler state management: Small bounded buffer (~20 KB) instead of tracking all objects
  • Faster polling: Typically completes in 1-2 seconds regardless of bucket size
  • Lower costs: ~99% reduction in API costs for long-running log buckets
  • Better reliability: Consistent performance as buckets grow over time

Limitations

  • Only suitable for sources with lexicographically-ordered keys (timestamp-based naming)
  • Not suitable for custom applications with non-ordered key naming
  • Lookback buffer size determines how "late" an object can arrive and still be caught

Related

  • https://gist.github.com/andrewkroh/5e8f5f3eca6efcd5b133e74aa8417180

andrewkroh avatar Dec 04 '25 17:12 andrewkroh

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

elasticmachine avatar Dec 08 '25 09:12 elasticmachine

@andrewkroh, are we going to enable AWS S3 in agentless (whenever lexicographical_ordering: true) since we can now have simpler state management?

kcreddy avatar Dec 11 '25 13:12 kcreddy