[BUG] Fixing Dataparallel Training
Bug description
https://github.com/NVIDIA-Merlin/models/issues/964 demonstrated an issue with our data parallel training in Merlin Models.
-
What is the correct data loader to use? We have https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/loader.py and then we have https://github.com/NVIDIA-Merlin/dataloader ?
-
Merlin Models Dataloader overwrites the global_rank variable https://github.com/NVIDIA-Merlin/models/blob/main/merlin/models/tf/loader.py#L307-L311 - what is the purpose of providing it as a parameter, when it will be overwritten? I think we should modify the if statement to global_rank is not None (same for global_size)
-
It seems that https://github.com/NVIDIA-Merlin/dataloader does not have this behavior - is that correct? Why is that the case?
-
Have we fixed the different number of batches for the different workers? In the example https://github.com/NVIDIA-Merlin/models/blob/main/examples/usecases/multi-gpu-data-parallel-training.ipynb, we use the work around to split the dataset BEFORE the training into <# GPUs> equal sizes and ignore the dataloader capability to split the dataset because it did not garantuee dataloader workers with the same number of batches and therefore, the training froze - see https://github.com/NVIDIA-Merlin/Merlin/issues/752
@bschifferer Why did we decide to split the dataset in the multi-gpu example instead of repartitioning the dataset with row_group_size? Did repartitioning not work?
@edknv I dont think repartitioning is an option. If you have 1TB dataset, does that work? How long will repartition take?
does repartiton garantuee equal size datasets?
I dont think repartitioning is an option. If you have 1TB dataset, does that work? How long will repartition take?
We are using dask dataframes so it's out of core, so I don't think the size of the dataset matters. Repartitioning should be pretty fast too. I'm guessing because I haven't tested it, but I would guess on the order of a few hundred milliseconds for a 1TB dataset.
does repartiton garantuee equal size datasets?
I don't think there's such a guarantee, but I believe utilizing df.to_parquet(..., row_group_size=...) and repartitioning is the best option we have until we have a more robust solution from the core or dataloader side.