spark-tfrecord icon indicating copy to clipboard operation
spark-tfrecord copied to clipboard

[Feature Request] Add option to batch data when using SequenceExample

Open utkarshgupta137 opened this issue 3 years ago • 5 comments

It would be great if this library could automatically create batches & save them using SequenceExample. I tried to batches myself, but I got memory issues when trying to do so. I think if it was handled properly at the partition level, then it would both be faster & easy to use.

utkarshgupta137 avatar May 19 '22 05:05 utkarshgupta137

I am curious why batching can not be done on user side? I don't see the benefit of doing it inside the converter. Assuming you will feed the examples to training/test/eval, won't TF handle batching automatically?

junshi15 avatar Jun 03 '22 15:06 junshi15

The difference in file size of say 1000 Example vs SequenceExample of 1000 rows is very high (unbatched data is ~50% larger in my case). Thus, it takes longer to read/write the files as well as increases memory/disk space requirements.

utkarshgupta137 avatar Jun 03 '22 15:06 utkarshgupta137

Which Spark operation does batching correspond to? GroupBy? Spark-TFRecord is implemented as a Spark data source (similar to Avro, Parquet, CSV), so it supports most data source options. I don't see batching in Spark's data source API. TFRecordReader does batching, why is it not an option for you?

junshi15 avatar Jun 05 '22 13:06 junshi15

Batching can be implemented by adding an index to all the rows & then assigning a batch to each row using batch = index % batch_size. Yes, TFRecordReader supports batching, but the whole point of doing it in spark is mentioned in my last comment.

utkarshgupta137 avatar Jun 05 '22 16:06 utkarshgupta137

It's not clear to me how to implement the logic in a Spark data source which basically is a format converter. Contributions are welcome.

junshi15 avatar Jun 05 '22 17:06 junshi15