kafka-connect-hdfs
kafka-connect-hdfs copied to clipboard
Duplicate File exists in HDFS Path
HDFS Sink is creating duplicate files (Same file name and content) in different time zones.
@NarasimhaKattunga could you elaborate more? The 2 screenshots look like they are in different paths. What is your topics.dir
for hdfs connector? I would think one of them is written by hdfs connector (the one that topics.dir
points to), and perhaps you have some other job that is replicating it to a different path in the same hdfs cluster?
We have ETL job to process topics.dir data files into another layer. During this process files will be moved out from topics.dir to /processed directory.
We have observed is that, same file name is appearing in topics.dir after some time but the file already processed by ETL job and moved out to /processed dir.