ecosystem
ecosystem copied to clipboard
GZIP/ZLIB compression for TFRecord files.
I am not 100% sure how reading compressed input is implemented in TensorFlow, but supporting outputting compressed TFRecord files would be amazing, as TFRecord is a rather inefficient format in terms of space.
Here are a few references to reading GZIP'd TF Records in the TensorFlow docs:
https://www.tensorflow.org/versions/r1.3/api_docs/cc/class/tensorflow/ops/fixed-length-record-reader#classtensorflow_1_1ops_1_1_fixed_length_record_reader
https://www.tensorflow.org/api_docs/python/tf/contrib/data/TFRecordDataset
https://www.tensorflow.org/versions/r1.3/api_docs/python/tf/python_io/TFRecordCompressionType
Compression is working fine. Here are some snippets that might help anyone looking for examples:
To create a record writer using GZIP compression:
options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.GZIP)
writer = tf.python_io.TFRecordWriter(outFilePath, options=options)
To create a TFRecordDataset to read the compressed files:
dataset = tf.data.TFRecordDataset(filenames=filenames, compression_type='GZIP', buffer_size=buffer_size)
Thanks @unitive-jim, the example you gave I believe is local writes only. Is there an example of writing compressed TFRecords output via Spark?
I have similar needs . saving the spark data frame output compressed files in TFR format . I tried using df.write.format("tensorflow").option("codec","org.apache.hadoop.io.compress.GzipCodec")
, but looks like this is being ignored as of now.
@unitive-jim im intrigued to see tensorflow using gzip for ftrecords i wonder if using CStringIO for flask files is nesary or does tensorflow servbles handle file uploads like this? in memory compreshion this post came to mind when I decided to prep for many large file uploads to a server with 16gb ram and a i7700
so flask to bitIO to Gzip seems feasible? Flask request file to bitIO bitIO to Gzip
then
tf.python_io.TFRecordReader()
but what I want to know is how to pass a bitIO or gzip object to a tfrecord directly so I don't put wear on my SSD or raid.
this has been fixed in this PR: https://github.com/tensorflow/ecosystem/pull/108
I tested PR #108. I works with minor changes. It would be cool to merge it. As it is closed I can also create a new PR request if needed.
Does PR #108 support Spark's .option()
API?
From reading the code you need to provide those options to the spark context.
spark.hadoop.mapreduce.output.fileoutputformat.compress:true
spark.hadoop.mapreduce.output.fileoutputformat.compress.codec:
org.apache.hadoop.io.compress.GzipCodec
So they will apply to the whole spark job.
If it is possible (not sure it is) it would be another commit inside spark tf connector to pass the option to TFRecordFileOutputFormat
FYI. This's fixed as I added a codec option in the Spark-Tensorflow connector. See https://github.com/tensorflow/ecosystem/commit/12d65f29b29a1b5bc975d9c11745b6e67818a6ae
use option("codec", "org.apache.hadoop.io.compress.GzipCodec") instead option("compression", "gzip") with spark-tensorflow-connector 1.15 version