streamalert
streamalert copied to clipboard
Improvement: Keep Track of S3 Processing
Background
When StreamAlert processes files from S3, there's a chance that the amount of parsed records could be very high (over 10k). In this case, the rule processor has the potential to Timeout during processing. Commonly, this is due to Firehose backing off too much while sending record batches, resulting in an Invocation Failure. Lambda then retries this same request, and sends the same batch of records back out repeatedly.
Steps to Reproduce
Configure the rule_processor
to accept s3_events
as input with very large files, and also have Firehose enabled.
Desired Change
The goal is to not process duplicate records from S3. This could be accomplished with something like this:
- Create a DDB table, with the S3 object name as the primary key
- Track how many alerts were sent, and if alert processing completed
- Track how many records were sent to Firehose
- When processing repetitive S3 objects, load state and resume where it left off