ray_shuffling_data_loader icon indicating copy to clipboard operation
ray_shuffling_data_loader copied to clipboard

[feature] support TensorFlow dataset binding in ray_data_loader

Open oliverhu opened this issue 4 years ago • 2 comments

We need to build a connector to TF dataset iterator.

Impl idea from @clarkzinzow: We’d take the base shuffling dataset, create a ShufflingTFDataset that converts each batch dataframe to feature and target tensors, and then pass that ShufflingTFDataset as the generator to tf.data.Dataset.from_generator, which can then be used as your typical TensorFlow dataset:

ds = tf.data.Dataset.from_generator(ShufflingTFDataset(filenames, num_epochs, num_trainers, batch_size, dataframe_to_tensor_spec))
for batch_idx, (features, targets) in enumerate(ds):
    print(f"Processing batch {batch_idx}!")

I can’t see any obvious issues with doing this except for mapping TensorFlow’s distributed dataset + data-parallel training paradigms to our current rank-based shuffling dataset, where we kick off the shuffle from the rank-0 training worker and give each worker an independent queue of batches. The latter should be doable from via getting the replica ID from tf.distribute.get_replica_context() during iteration and using that to access the correct queue, but the former paradigm may need to be tweaked.

oliverhu avatar Jun 24 '21 03:06 oliverhu

@clarkzinzow could you add some unit/integration tests + dev guide to the code base? it is currently a bit hard for external people to iterate.

Also better to provide a raw pytorch distributed training mnist example, horovod is making the example even harder to understand..

oliverhu avatar Jun 25 '21 05:06 oliverhu

Hi @oliverhu, thank you for opening this feature request!

@clarkzinzow could you add some unit/integration tests + dev guide to the code base? it is currently a bit hard for external people to iterate.

We can definitely do that! We're also doing some internal knowledge transfer for this data loader next week, so I should be able to allocate some time at the beginning of next week to add a test suite and a development guide. I'll ping you once those are added!

Also better to provide a raw pytorch distributed training mnist example, horovod is making the example even harder to understand..

That's a great point! I'll look at adding a plain pytorch example as well.

clarkzinzow avatar Jun 25 '21 14:06 clarkzinzow