streamx icon indicating copy to clipboard operation
streamx copied to clipboard

S3 partition file per hourly batch

Open panda87 opened this issue 7 years ago • 1 comments

Hi

I'd like to know if there is an option to write one file per partition which means per hour. For example, if i have 5 workers with 5 tasks, and I run hourly batch, is this plugins would know to aggregate the data to one file per the running batch?

Thanks D.

panda87 avatar Jul 26 '17 15:07 panda87

You could use the TimeBasedPartitioner and a rotation interval configured for an hour.

However, this is not recommended for large volume topics and the Connector needs to hold an hour worth of data.

Also, why do you need this? Spark, Presto, Pig, Hive, etc can all read multiple files from an upper level s3 path

OneCricketeer avatar Dec 01 '17 06:12 OneCricketeer