ecosystem icon indicating copy to clipboard operation
ecosystem copied to clipboard

GZIP/ZLIB compression for TFRecord files.

Open thesuperzapper opened this issue 7 years ago • 11 comments

I am not 100% sure how reading compressed input is implemented in TensorFlow, but supporting outputting compressed TFRecord files would be amazing, as TFRecord is a rather inefficient format in terms of space.

Here are a few references to reading GZIP'd TF Records in the TensorFlow docs:

https://www.tensorflow.org/versions/r1.3/api_docs/cc/class/tensorflow/ops/fixed-length-record-reader#classtensorflow_1_1ops_1_1_fixed_length_record_reader

https://www.tensorflow.org/api_docs/python/tf/contrib/data/TFRecordDataset

https://www.tensorflow.org/versions/r1.3/api_docs/python/tf/python_io/TFRecordCompressionType

thesuperzapper avatar Aug 13 '17 23:08 thesuperzapper

Compression is working fine. Here are some snippets that might help anyone looking for examples:

To create a record writer using GZIP compression:

options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.GZIP)
writer = tf.python_io.TFRecordWriter(outFilePath, options=options)

To create a TFRecordDataset to read the compressed files:

dataset = tf.data.TFRecordDataset(filenames=filenames, compression_type='GZIP', buffer_size=buffer_size)

unitive-jim avatar Jan 04 '18 03:01 unitive-jim

Thanks @unitive-jim, the example you gave I believe is local writes only. Is there an example of writing compressed TFRecords output via Spark?

aht avatar Jan 30 '18 00:01 aht

I have similar needs . saving the spark data frame output compressed files in TFR format . I tried using df.write.format("tensorflow").option("codec","org.apache.hadoop.io.compress.GzipCodec"), but looks like this is being ignored as of now.

rkbansal83 avatar Feb 06 '18 21:02 rkbansal83

@unitive-jim im intrigued to see tensorflow using gzip for ftrecords i wonder if using CStringIO for flask files is nesary or does tensorflow servbles handle file uploads like this? in memory compreshion this post came to mind when I decided to prep for many large file uploads to a server with 16gb ram and a i7700

fenderrex avatar Aug 24 '18 21:08 fenderrex

so flask to bitIO to Gzip seems feasible? Flask request file to bitIO bitIO to Gzip

then

tf.python_io.TFRecordReader()

but what I want to know is how to pass a bitIO or gzip object to a tfrecord directly so I don't put wear on my SSD or raid.

fenderrex avatar Aug 24 '18 21:08 fenderrex

this has been fixed in this PR: https://github.com/tensorflow/ecosystem/pull/108

bigbear2017 avatar Nov 30 '18 06:11 bigbear2017

I tested PR #108. I works with minor changes. It would be cool to merge it. As it is closed I can also create a new PR request if needed.

fhoering avatar Mar 14 '19 15:03 fhoering

Does PR #108 support Spark's .option() API?

jeisinge avatar Mar 25 '19 20:03 jeisinge

From reading the code you need to provide those options to the spark context.

spark.hadoop.mapreduce.output.fileoutputformat.compress:true
spark.hadoop.mapreduce.output.fileoutputformat.compress.codec:
org.apache.hadoop.io.compress.GzipCodec

So they will apply to the whole spark job.

If it is possible (not sure it is) it would be another commit inside spark tf connector to pass the option to TFRecordFileOutputFormat

fhoering avatar Mar 26 '19 07:03 fhoering

FYI. This's fixed as I added a codec option in the Spark-Tensorflow connector. See https://github.com/tensorflow/ecosystem/commit/12d65f29b29a1b5bc975d9c11745b6e67818a6ae

vgod-dbx avatar May 09 '19 17:05 vgod-dbx

use option("codec", "org.apache.hadoop.io.compress.GzipCodec") instead option("compression", "gzip") with spark-tensorflow-connector 1.15 version

bravecharge avatar Jun 30 '20 07:06 bravecharge