[Filebeat]: aws-s3 input - Add Lexicographical Ordering Mode
Problem
The current S3 bucket polling mode works universally but requires tracking every processed object, which becomes challenging for the most common AWS log sources (CloudTrail, VPC Flow Logs, ALB/NLB Access Logs, S3 Access Logs, CloudFront Logs). These services generate logs continuously, and as buckets accumulate objects over time, users experience:
- Growing memory usage and slower performance
- Increasing API costs as more objects accumulate
- More complex state management
Opportunity
The most common AWS log sources write to S3 with lexicographically-ordered object keys due to their timestamp-based naming conventions (e.g., YYYY/MM/DD paths or YYYYMMDDTHHmmZ timestamps). This ordering pattern can be leveraged to make monitoring these sources simpler and more efficient.
The S3 ListObjectsV2 API's start-after parameter enables optimized incremental polling specifically designed for these common AWS log sources:
- List only new objects since the last poll (no need to scan entire bucket)
- Use a small bounded buffer instead of tracking all objects
- Reduce API calls dramatically (typically 1-10 per poll vs hundreds/thousands)
Proposed Solution
Configuration
filebeat.inputs:
- type: aws-s3
bucket_arn: arn:aws:s3:::my-cloudtrail-bucket
bucket_list_prefix: AWSLogs/123456789012/CloudTrail/
lexicographical_ordering: true
lexicographical_lookback_keys: 100 # Required: buffer size for out-of-order protection
Technical Requirements
-
Lookback Buffer (Mandatory): Maintain a bounded buffer of the N most recently processed keys (default: 100). The
start-afterparameter uses the lexicographically oldest key in the buffer, creating a sliding window that catches late-arriving objects. -
State Storage: Store only the lookback buffer (~20 KB for default 100 keys) instead of all processed objects, providing O(1) bounded memory regardless of total objects processed.
-
Out-of-Order Handling: The lookback buffer ensures objects that arrive out of lexicographical order (due to late delivery, clock skew, or retries) are still detected and processed.
-
API Usage: Use
ListObjectsV2withstart-afterparameter set to the oldest key in the lookback buffer. The AWS SDK's paginator will automatically handlecontinuation-tokenfor pagination when there are more than 1,000 new objects.
Benefits
This mode makes it easier and more reliable to monitor common AWS log sources:
- Simpler state management: Small bounded buffer (~20 KB) instead of tracking all objects
- Faster polling: Typically completes in 1-2 seconds regardless of bucket size
- Lower costs: ~99% reduction in API costs for long-running log buckets
- Better reliability: Consistent performance as buckets grow over time
Limitations
- Only suitable for sources with lexicographically-ordered keys (timestamp-based naming)
- Not suitable for custom applications with non-ordered key naming
- Lookback buffer size determines how "late" an object can arrive and still be caught
Related
- https://gist.github.com/andrewkroh/5e8f5f3eca6efcd5b133e74aa8417180
Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)
@andrewkroh, are we going to enable AWS S3 in agentless (whenever lexicographical_ordering: true) since we can now have simpler state management?