models [BUG] Fixing Dataparallel Training

Bug description

https://github.com/NVIDIA-Merlin/models/issues/964 demonstrated an issue with our data parallel training in Merlin Models.

What is the correct data loader to use? We have https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/loader.py and then we have https://github.com/NVIDIA-Merlin/dataloader ?
Merlin Models Dataloader overwrites the global_rank variable https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/loader.py#L307-L311 - what is the purpose of providing it as a parameter, when it will be overwritten? I think we should modify the if statement to global_rank is not None (same for global_size)
It seems that https://github.com/NVIDIA-Merlin/dataloader does not have this behavior - is that correct? Why is that the case?
Have we fixed the different number of batches for the different workers? In the example https://github.com/NVIDIA-Merlin/models/blob/main/examples/usecases/multi-gpu-data-parallel-training.ipynb, we use the work around to split the dataset BEFORE the training into <# GPUs> equal sizes and ignore the dataloader capability to split the dataset because it did not garantuee dataloader workers with the same number of batches and therefore, the training froze - see https://github.com/NVIDIA-Merlin/Merlin/issues/752

Feb 02 '23 10:02 bschifferer

@bschifferer Why did we decide to split the dataset in the multi-gpu example instead of repartitioning the dataset with row_group_size? Did repartitioning not work?

Feb 06 '23 16:02 edknv

@edknv I dont think repartitioning is an option. If you have 1TB dataset, does that work? How long will repartition take?

Feb 07 '23 13:02 bschifferer

does repartiton garantuee equal size datasets?

Feb 07 '23 13:02 bschifferer

I dont think repartitioning is an option. If you have 1TB dataset, does that work? How long will repartition take?

We are using dask dataframes so it's out of core, so I don't think the size of the dataset matters. Repartitioning should be pretty fast too. I'm guessing because I haven't tested it, but I would guess on the order of a few hundred milliseconds for a 1TB dataset.

does repartiton garantuee equal size datasets?

I don't think there's such a guarantee, but I believe utilizing df.to_parquet(..., row_group_size=...) and repartitioning is the best option we have until we have a more robust solution from the core or dataloader side.

Feb 07 '23 19:02 edknv