benchmarks
benchmarks copied to clipboard
How to use parallel_interleave in TensorFlow
I am reading the benchmarks source code.
The following piece of code is the part that creates TensorFlow dataset from TFRecord files:
ds = tf.data.TFRecordDataset.list_files(tfrecord_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(tf.data.TFRecordDataset, cycle_length=10))
I am trying to change this code to create dataset directly from JPEG image files:
ds = tf.data.Dataset.from_tensor_slices(jpeg_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(?, cycle_length=10))
I don't know what to write in the ? place. The map_func in parallel_interleave() is init() of tf.data.TFRecordDataset class for TFRecord files, but I don't know what to write for JPEG files.
We don't need to do any transformations here. Because we will zip two datasets and then do the transformations later. The code is as follows:
counter = tf.data.Dataset.range(batch_size)
ds = tf.data.Dataset.zip((ds, counter))
ds = ds.apply( \
batching.map_and_batch( \
map_func=preprocess_fn, \
batch_size=batch_size, \
num_parallel_batches=num_splits))
Because we don't need transformation in ? place, I tried to use an empty map_func, but there is error "map_funcmust return aDataset` object". I also tried to use tf.data.Dataset, but the output says Dataset is an abstract class that is not allowed to put there.
I asked the same question on Stack Overflow, but no one answered it. I think the developer here can help this issue. Thanks very much.
Couple things, you may want to look, which you likely already saw below linked. I want to stress my answer is very sloppy but I did some research and I think some of my information might be useful. I could not find an exact answer as this feature is new and we almost always use TFRecords. I did not try to run any of the code I typed out here. My thought is my answer actually is better than nothing and I think if wrong might still help as I think I am in the right direction.
- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/performance/datasets_performance.md
- https://www.tensorflow.org/programmers_guide/datasets
- https://www.tensorflow.org/api_docs/python/tf/contrib/data/parallel_interleave
Some other high level thoughts:
- The value of the parallel_interleave is processing chunks of files, e.g. RecordIO. You might still see some value in doing this with individual JPEGs but you would likely see more value making the JPEGs TFRecords bundled in to 100MB files.
- You want to include the tf.read_file as part of tf.data in the parallel_interleave and I am pretty sure the function needs to return a dataset which makes sense because TFRecordDataset#init returns a dataset
# Reads an image from a file
def _parse_function(filename):
image_string = tf.read_file(filename)
# You could just decode it here as well but that might not be optimal as this setup is about
# reading the file. You might also need to make this a list of one.
return tf.data.from_tensors(image_string)
# gets a dataset with the file names
ds_filenames = tf.data.Dataset.list_files("/path/to/data/*.jpeg")
# Indicate you want to read the files in parallel.
dataset = ds_filenames.apply(tf.contrib.data.parallel_interleave(
# possible you do not need the lambda and you could just pass _parse_function
lambda filename: _parse_function(filename),
cycle_length=10))
The benchmark team almost always reads TFRecords and are not necessarily experts on tf.data. I did not want to leave you without an answer. I cannot promise I will answer a followup but I will try. I also suggest linking to a full example in github that I or someone else can run vs. snippets. That is much easier to tweak than snippets. If I had running code the chances of getting you answer go up 10x vs. snippets where I have to setup my own example to verify the concept if I cannot find example code.
Hello I deal it with a generator, I don't know if it will penalize the speed.
class generator_yield:
def __init__(self, file):
self.file = file
def __call__(self):
with h5py.File(self.file, 'r') as f:
yield f['X'][:], f['y'][:]
You could customize the h5py part with pillow/pandas. Then create a list of JPEG paths and pass it into tf.data.Dataset:
fnames = tf.data.Dataset.from_tensor_slices(fnames)
Finally apply parallel_interleave
with tf.Session() as sess:
batches = fnames.apply(tf.data.experimental.parallel_interleave(lambda filename: tf.data.Dataset.from_generator(
generator_yield(filename), output_types=tf.float32,
output_shapes=tf.TensorShape([img_size, img_size])), cycle_length=len_fnames)