kafka-connect-hdfs
kafka-connect-hdfs copied to clipboard
Custom output directory naming (without topic in it)
Hello, is it possible / safe to provide a custom directory naming convention where one do not want to use actual topic name?
I see it possible via custom partitioner where topic is passed in an can possibly not be used: https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/partitioner/Partitioner.java#L38
However, it fees not used consistently and e.g. offset recovery seems to be looking hard for the the directory with name of the topic: https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java#L602
Question is, If I can safely (somehow) not use the topic name ( and possibly have also custom hive table name ) or it is intentionally not possible for some reason.
I think my case is the same as https://github.com/confluentinc/kafka-connect-hdfs/issues/515 and customisation I am looking for will break recovery.
I wonder, here: https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java#L602
Dir for looking up max offset is FileUtils.topicDirectory(url, topicsDir, tp.topic())
Would it work to just use FileUtils.topicDirectory(url, topicsDir) since filter is looking for files and checking the name of a file to contain correct topic name.
Would it be possible to get such feature? Would it be accepted if it is contributed?
A bit more about my my use-case is, that I have master kafka cluster receive mirror traffic from N remote clusters. Each mirror writes to topic with prefix name. Want to dump N topics under the single directory rather then have N HDFS directories and have each HDFS consumer deal with that.
@dosvath , @avocader would you be able to give an advice here or recommend where such question should be discussed?
I have created an MR which should address this feature request. https://github.com/confluentinc/kafka-connect-hdfs/pull/553
Can someone please review and let me know if this is acceptable?