Merlin [RMP] Multi-GPU Data Parallel training for Tensorflow

Problem:

Single GPU training takes significantly longer than multi-gpu. Customers would like to be able to accelerate their training workflows by distributing training across multiple GPUs on a single node.

Goal:

Enable customers to do data parallel training within Merlin Models training pipeline.

Constraints:

Single node
Embedding tables fit within the memory of a single gpu
Use NVIDIA best practices; aka Horovod

Starting Point:

Enable Horovod within the Merlin Models API

Aug 10 '22 00:08 viswa-nvidia

@marcromeyn can you flesh this out a little further.

Aug 31 '22 17:08 EvenOldridge

This is a related ticket #764 - [BUG] Models does not support to tf.distribute.MirroredStrategy() for data parallel training. It is multi-GPU but not Horovod,

Sep 26 '22 17:09 gabrielspmoreira

@viswa-nvidia to follow up with DLFW team regarding native support

Sep 26 '22 17:09 viswa-nvidia

@viswa-nvidia @EvenOldridge

We need to add following success criteria :

Analyze scaling factor by using multiple GPUs: If we go from 1x GPU -> 2x GPUs -> 4x GPUs -> 8 GPUx - how much higher is the throughput?
Provide performance metrics (accuracy / AUC / etc) to show that there is no negative effect in the model performance
Provide guidance how to set global batch size, batch size per GPU and learning rate when scaling

If we provide only the technical functionality without testing the points above, we cannot guarantee it is working. I get these questions often from customers.

Sep 29 '22 07:09 bschifferer

I provided an example to show that Merlin Models work with horovods: https://github.com/NVIDIA-Merlin/models/pull/778

However, we need to address the points above + the bug ( https://github.com/NVIDIA-Merlin/dataloader/issues/75 ).

In addition, we should make it more user-friendly.

Sep 29 '22 09:09 bschifferer

Noted. @EvenOldridge , please review and add to the goals. I am not sure if this ticket is fully defined.

Oct 06 '22 01:10 viswa-nvidia

I've been doing some experimentation with horovod integration with Models API, based on @bschifferer's example https://github.com/NVIDIA-Merlin/models/pull/778, and I fully agree with his point on all the success criteria he listed above, and also the need for dataloader to produce equal number of batches across partitions, as mentioned in https://github.com/NVIDIA-Merlin/dataloader/issues/75.

Some additional notes on https://github.com/NVIDIA-Merlin/dataloader/issues/75, it seems that the dataloader produces an unequal number of batches when the dataset is partitioned, which is problematic for horovod because one worker might finish processing all the batches and wait idle and/or hang and/or time out while the other worker(s) are still processing their batches. There might be some workaround like seeding from the dataloader side as mentioned in the issue, or from horovod side using hvd.join(), but the best solution is to have dataloader produce equal number of batches when partitioned.

Oct 06 '22 05:10 edknv

I provided following example based on the current code: https://github.com/NVIDIA-Merlin/models/pull/855 - as I am OOO next week.

@edknv did a great job to provide the horovod functionality in Merlin Models.

I think we need to review the current flow of multi-GPU. I think it is not fully user-friendly / end-to-end integration:

I am not sure, if the issues with unequal batch size for the data loader is solved: https://github.com/NVIDIA-Merlin/dataloader/issues/75 - if the solution is moved to the data is generated, correctly - how das that work? How are we ensure it with NVTabular? How about users who do NOT use NVTabular?
The unittest is written that each worker runs through the FULL dataset per epoch. That is incorrect. If we have 1M data points and 2 GPUs, each GPU should run only through 500k data points. I wrote the example that NVTabular produces distinct files per worker - however, that is not a solution which guarantees the point above.

We havent look on the points here:

Analyze scaling factor by using multiple GPUs: If we go from 1x GPU -> 2x GPUs -> 4x GPUs -> 8 GPUx - how much higher is the throughput?
Provide performance metrics (accuracy / AUC / etc) to show that there is no negative effect in the model performance
Provide guidance how to set global batch size, batch size per GPU and learning rate when scaling

Nov 03 '22 13:11 bschifferer

@bschifferer , please create a separate RMP ticket for multi GpU enhancement

Nov 15 '22 18:11 viswa-nvidia

@viswa-nvidia, I don't think that the tasks under Staring Point and Examples are the relevant items. Can you please follow up with @edknv and @bschifferer to capture tasks that enabled this roadmap ticket? think they are under here? https://github.com/orgs/NVIDIA-Merlin/projects/6/views/34?filterQuery=RMP536-DATA+PARALLEL+MULTI-GPU

Dec 16 '22 09:12 jsohn-nvidia

Merlin Merlin copied to clipboard

[RMP] Multi-GPU Data Parallel training for Tensorflow

Problem:

Goal:

Constraints:

Starting Point:

Merlin
Merlin copied to clipboard