cobrix icon indicating copy to clipboard operation
cobrix copied to clipboard

Support for stream processing variable length records

Open chandrasekaravr opened this issue 6 years ago • 3 comments

@yruslan I have large set of input files for processing. I wanted to take the stream processing approach. But, when i tried with the same options that i have successfully used for variable length records, like, .option("generate_record_id", true) .option("is_record_sequence", "true") .option("is_rdw_big_endian", "true") .option("variable_size_occurs", "true") It was getting processed as FixedLength records.

Any thoughts or suggestions?

chandrasekaravr avatar Oct 09 '19 13:10 chandrasekaravr

Yes, unfortunately streaming variable-record length files is not supported yet. What is your use case? Do you want to stream from a Kafka topic?

yruslan avatar Oct 10 '19 14:10 yruslan

@yruslan Thanks for the information Ruslan. The use case here is, we have a lot, few hundred of ~30 GB each EBCDIC files with variable length records, that need to be processed from S3 in EMR. Idea is to let the copy over from S2 to EMR hdfs in parallel and let the stream based processing up that will perform ETL.

chandrasekaravr avatar Oct 10 '19 15:10 chandrasekaravr

Interesting. So your use case is that you want to process a large set of files in a stream fashion so that files that are already transformed become available as soon possible, right? If so, the best way, for now, is to process each file in a separate batch job and have some sort scheduling mechanism to trigger the next job once a previous one is finished.

yruslan avatar Oct 11 '19 07:10 yruslan