ray_shuffling_data_loader
ray_shuffling_data_loader copied to clipboard
[feature] support TensorFlow dataset binding in ray_data_loader
We need to build a connector to TF dataset iterator.
Impl idea from @clarkzinzow: We’d take the base shuffling dataset, create a ShufflingTFDataset that converts each batch dataframe to feature and target tensors, and then pass that ShufflingTFDataset as the generator to tf.data.Dataset.from_generator, which can then be used as your typical TensorFlow dataset:
ds = tf.data.Dataset.from_generator(ShufflingTFDataset(filenames, num_epochs, num_trainers, batch_size, dataframe_to_tensor_spec))
for batch_idx, (features, targets) in enumerate(ds):
print(f"Processing batch {batch_idx}!")
I can’t see any obvious issues with doing this except for mapping TensorFlow’s distributed dataset + data-parallel training paradigms to our current rank-based shuffling dataset, where we kick off the shuffle from the rank-0 training worker and give each worker an independent queue of batches. The latter should be doable from via getting the replica ID from tf.distribute.get_replica_context() during iteration and using that to access the correct queue, but the former paradigm may need to be tweaked.
@clarkzinzow could you add some unit/integration tests + dev guide to the code base? it is currently a bit hard for external people to iterate.
Also better to provide a raw pytorch distributed training mnist example, horovod is making the example even harder to understand..
Hi @oliverhu, thank you for opening this feature request!
@clarkzinzow could you add some unit/integration tests + dev guide to the code base? it is currently a bit hard for external people to iterate.
We can definitely do that! We're also doing some internal knowledge transfer for this data loader next week, so I should be able to allocate some time at the beginning of next week to add a test suite and a development guide. I'll ping you once those are added!
Also better to provide a raw pytorch distributed training mnist example, horovod is making the example even harder to understand..
That's a great point! I'll look at adding a plain pytorch example as well.