kafka-connect-hdfs icon indicating copy to clipboard operation
kafka-connect-hdfs copied to clipboard

Duplicate File exists in HDFS Path

Open NarasimhaKattunga opened this issue 5 years ago • 2 comments

HDFS Sink is creating duplicate files (Same file name and content) in different time zones.

Capture Capture1

NarasimhaKattunga avatar Nov 18 '19 14:11 NarasimhaKattunga

@NarasimhaKattunga could you elaborate more? The 2 screenshots look like they are in different paths. What is your topics.dir for hdfs connector? I would think one of them is written by hdfs connector (the one that topics.dir points to), and perhaps you have some other job that is replicating it to a different path in the same hdfs cluster?

ncliang avatar Dec 12 '19 19:12 ncliang

We have ETL job to process topics.dir data files into another layer. During this process files will be moved out from topics.dir to /processed directory.

We have observed is that, same file name is appearing in topics.dir after some time but the file already processed by ETL job and moved out to /processed dir.

NarasimhaKattunga avatar Oct 09 '20 09:10 NarasimhaKattunga