Using the S3 input plugin, duplicate data collection occurs
My Logstash version is 7.17.7. When deploying multiple Logstash nodes to simultaneously collect data from OBS buckets at the same time, there is a probabilistic occurrence of data duplication issues. Is there any way to solve this problem?
My configuration like this. `input { s3 { access_key_id => "xxx" codec => "plain" secret_access_key => "xxx" region => "xxx" bucket => "obs-test" prefix => "test/log" interval => 3 delete => true endpoint => "xxx" temporary_directory => "xxx" } }
output { file { path => ["xxx"] } }`
I used delete => true, but in a multi-node scenario, the file might not have been deleted yet and could be read simultaneously by other nodes.
Can the S3 plugin only be deployed on a single node?
Howdy! This is a classic problem with ingesting from object storage.
There is no coordination between Logstash nodes and so in between the time a node grabs an object, processes it, and deletes it there is a window where another node could grab the same object.
The classic solutions to this problem are to:
- Have each node only process a designated path within the bucket
- Use the community-supported S3-SNS-SQS plugin. With this you would set up your AWS account to push a message to an SQS queue each time an object is written to S3. Logstash then subscribes to the SQS queue and gets "notified" when there is a file available to process. This ensures that only one node is engaged for each file that needs to be processed.
While SQS-S3 is not an Elastic support Logstash plugin, using SQS-S3 is supported with Elastic Agent. We've worked really hard to make this a great experience with Elastic Agent -- I wrote a blog about the performance improvements in this area you can check out here https://www.elastic.co/blog/elastic-agent-amazon-s3-sqs. Once gathered with Elastic Agent you can of course always send the data to Logstash