beats [Filebeat]: aws-s3 input - Add Lexicographical Ordering Mode

Problem

The current S3 bucket polling mode works universally but requires tracking every processed object, which becomes challenging for the most common AWS log sources (CloudTrail, VPC Flow Logs, ALB/NLB Access Logs, S3 Access Logs, CloudFront Logs). These services generate logs continuously, and as buckets accumulate objects over time, users experience:

Growing memory usage and slower performance
Increasing API costs as more objects accumulate
More complex state management

Opportunity

The most common AWS log sources write to S3 with lexicographically-ordered object keys due to their timestamp-based naming conventions (e.g., YYYY/MM/DD paths or YYYYMMDDTHHmmZ timestamps). This ordering pattern can be leveraged to make monitoring these sources simpler and more efficient.

The S3 ListObjectsV2 API's start-after parameter enables optimized incremental polling specifically designed for these common AWS log sources:

List only new objects since the last poll (no need to scan entire bucket)
Use a small bounded buffer instead of tracking all objects
Reduce API calls dramatically (typically 1-10 per poll vs hundreds/thousands)

Proposed Solution

Configuration

filebeat.inputs:
- type: aws-s3
  bucket_arn: arn:aws:s3:::my-cloudtrail-bucket
  bucket_list_prefix: AWSLogs/123456789012/CloudTrail/
  lexicographical_ordering: true
  lexicographical_lookback_keys: 100  # Required: buffer size for out-of-order protection

Technical Requirements

Lookback Buffer (Mandatory): Maintain a bounded buffer of the N most recently processed keys (default: 100). The start-after parameter uses the lexicographically oldest key in the buffer, creating a sliding window that catches late-arriving objects.
State Storage: Store only the lookback buffer (~20 KB for default 100 keys) instead of all processed objects, providing O(1) bounded memory regardless of total objects processed.
Out-of-Order Handling: The lookback buffer ensures objects that arrive out of lexicographical order (due to late delivery, clock skew, or retries) are still detected and processed.
API Usage: Use ListObjectsV2 with start-after parameter set to the oldest key in the lookback buffer. The AWS SDK's paginator will automatically handle continuation-token for pagination when there are more than 1,000 new objects.

Benefits

This mode makes it easier and more reliable to monitor common AWS log sources:

Simpler state management: Small bounded buffer (~20 KB) instead of tracking all objects
Faster polling: Typically completes in 1-2 seconds regardless of bucket size
Lower costs: ~99% reduction in API costs for long-running log buckets
Better reliability: Consistent performance as buckets grow over time

Limitations

Only suitable for sources with lexicographically-ordered keys (timestamp-based naming)
Not suitable for custom applications with non-ordered key naming
Lookback buffer size determines how "late" an object can arrive and still be caught

[Filebeat]: aws-s3 input - Add Lexicographical Ordering Mode

Problem

Opportunity

Proposed Solution

Configuration

Technical Requirements

Benefits

Limitations

Related