logstash icon indicating copy to clipboard operation
logstash copied to clipboard

Using the S3 input plugin, duplicate data collection occurs

Open gaiyin opened this issue 5 months ago • 2 comments

My Logstash version is 7.17.7. When deploying multiple Logstash nodes to simultaneously collect data from OBS buckets at the same time, there is a probabilistic occurrence of data duplication issues. Is there any way to solve this problem?

My configuration like this. `input { s3 { access_key_id => "xxx" codec => "plain" secret_access_key => "xxx" region => "xxx" bucket => "obs-test" prefix => "test/log" interval => 3 delete => true endpoint => "xxx" temporary_directory => "xxx" } }

output { file { path => ["xxx"] } }`

I used delete => true, but in a multi-node scenario, the file might not have been deleted yet and could be read simultaneously by other nodes.

gaiyin avatar Jul 18 '25 02:07 gaiyin

Can the S3 plugin only be deployed on a single node?

gaiyin avatar Jul 18 '25 04:07 gaiyin

Howdy! This is a classic problem with ingesting from object storage.

There is no coordination between Logstash nodes and so in between the time a node grabs an object, processes it, and deletes it there is a window where another node could grab the same object.

The classic solutions to this problem are to:

  1. Have each node only process a designated path within the bucket
  2. Use the community-supported S3-SNS-SQS plugin. With this you would set up your AWS account to push a message to an SQS queue each time an object is written to S3. Logstash then subscribes to the SQS queue and gets "notified" when there is a file available to process. This ensures that only one node is engaged for each file that needs to be processed.

While SQS-S3 is not an Elastic support Logstash plugin, using SQS-S3 is supported with Elastic Agent. We've worked really hard to make this a great experience with Elastic Agent -- I wrote a blog about the performance improvements in this area you can check out here https://www.elastic.co/blog/elastic-agent-amazon-s3-sqs. Once gathered with Elastic Agent you can of course always send the data to Logstash

strawgate avatar Jul 19 '25 13:07 strawgate