kafka-connect-hdfs
kafka-connect-hdfs copied to clipboard
Feature request: Rotation based on maximum file size on hdfs.
I'd like the option to specify the maximum file size for the hdfs connector to write before rotating. I understand the only way to do this is to approximate it by setting flushSize (based on # records) or time interval. The reason is that it's very useful to keep files at the approximate size of the hdfs block size, but no more. This gives us a more fine grained control and the assurance that we won't end up with to large files, or worse, many small files. I'd like to add this feature to the codebase myself if possible. Are there any restrictions / guidelines I have to take into account if I wan the possibility to merge these changes back into the codebase? Or was this feature previously explored and abandoned for some reason?
I think this would be useful change, if handled appropriately. For example a schema could change within the middle of a file (in any format, not necessarily Avro or Parquet). Text or JSON might not be that big of a deal, but would be annoying if a process failed parsing 8 CSV columns for example, but only got 7.
Plus, time based partitioning should not wait for the size to be reached before writing the data into the "truncated" datetime for a given partitioner
The "workarounds" are all downstream processors that I've enumerated in #271
@TomLous this feature is useful indeed. It hasn't been rejected previously, as far as I know. We'd need an elegant way to track bytes exported and when the limit is about to be reached (or has just been exceeded).
@cricket007 the partitioning on size should not split records and should respect other partitioning criteria.
Thanks for the feedback.
I've already created an implementation that respects other filter criteria for my current client. I'm waiting for their legal team to get back to me, so I can create a PR.
Any progress on this feature?
Sorry, not from my end. The OK was never given at eBay and now I'm no longer working there. I'm also not allowed/able to share progress we made on this feature unfortunately. I can say that it's pretty hard without major rewrite, because the logic in the TopicPartitionWriter is based on, guess what, partitions, instead of individual files. So the best we could implement is if 1 file reached the max size => rotate all. It's not pretty and in the end we moved away from KafkaConnect and moved to Flink to have more fine grained control over the HDFS files.
Any progress on this feature guys?
Here https://github.com/confluentinc/kafka-connect-hdfs/compare/5.5.x...vipinkumar7:5.5.x?expand=1
This is what I implemented for Sequence file , write to file in append mode till reach file size limit and then only do final commit till then it will keep writing to same temp file
Let me know guys what are your input so I will refactor it
#671 implements file size based rotation without any probing of file sizes on HDFS.