kafka-connect-hdfs icon indicating copy to clipboard operation
kafka-connect-hdfs copied to clipboard

Custom output directory naming (without topic in it)

Open JozoVilcek opened this issue 4 years ago • 5 comments

Hello, is it possible / safe to provide a custom directory naming convention where one do not want to use actual topic name?

I see it possible via custom partitioner where topic is passed in an can possibly not be used: https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/partitioner/Partitioner.java#L38

However, it fees not used consistently and e.g. offset recovery seems to be looking hard for the the directory with name of the topic: https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java#L602

Question is, If I can safely (somehow) not use the topic name ( and possibly have also custom hive table name ) or it is intentionally not possible for some reason.

JozoVilcek avatar Feb 03 '21 15:02 JozoVilcek

I think my case is the same as https://github.com/confluentinc/kafka-connect-hdfs/issues/515 and customisation I am looking for will break recovery.

JozoVilcek avatar Feb 04 '21 12:02 JozoVilcek

I wonder, here: https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java#L602 Dir for looking up max offset is FileUtils.topicDirectory(url, topicsDir, tp.topic()) Would it work to just use FileUtils.topicDirectory(url, topicsDir) since filter is looking for files and checking the name of a file to contain correct topic name.

JozoVilcek avatar Feb 04 '21 16:02 JozoVilcek

Would it be possible to get such feature? Would it be accepted if it is contributed?

A bit more about my my use-case is, that I have master kafka cluster receive mirror traffic from N remote clusters. Each mirror writes to topic with prefix name. Want to dump N topics under the single directory rather then have N HDFS directories and have each HDFS consumer deal with that.

JozoVilcek avatar Feb 24 '21 08:02 JozoVilcek

@dosvath , @avocader would you be able to give an advice here or recommend where such question should be discussed?

JozoVilcek avatar Feb 25 '21 08:02 JozoVilcek

I have created an MR which should address this feature request. https://github.com/confluentinc/kafka-connect-hdfs/pull/553

Can someone please review and let me know if this is acceptable?

JozoVilcek avatar Mar 17 '21 12:03 JozoVilcek