betty icon indicating copy to clipboard operation
betty copied to clipboard

[REQUEST] Distributed data parallel training

Open sangkeun00 opened this issue 2 years ago • 3 comments

Currently, Betty only supports torch.nn.DataParallel. Compared to torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel is much slower even in the single-machine multi-gpu settings. Therefore, we need to replace torch.nn.DataParallel with torch.nn.parallel.DistributedDataParallel for better training speed and the multi-machine multi-gpu support.

sangkeun00 avatar Jul 01 '22 15:07 sangkeun00

can't agree more!

gongbudaizhe avatar Jul 12 '22 09:07 gongbudaizhe

The main issue is that highly efficient gradient synchronization (all-reduce) of DistributedDataParallel only works with torch.autograd.backward. However, meta-learning/MLO heavily uses torch.autograd.grad instead of torch.autograd.backward.

An ad-hoc solution is manually performing gradient synchronization, however this may degrade throughput as we can't add any tricks like computation-communication overlapping (register_hook is also not supported for torch.autograd.grad :disappointed: ). This may not be a huge issue if your setting is single-machine multi-gpu with very high communication bandwidth (eg nvlink) or if your model is not super large (probably anything less than BERT).

Therefore, I may implement this ad-hoc option soon, and once PyTorch supports efficient synchronization for torch.autograd.grad, we may update our strategy. If you have any questions or are willing to work on this feature, feel free to let me know!

Best, Sang

sangkeun00 avatar Jul 12 '22 13:07 sangkeun00

I understand the difficulty of implementing this feature, but in my opinion, this is the single most important feature that betty should have while similar repos like https://github.com/facebookresearch/higher/issues/116 doesn't. To make GML/MLO more impactful, it should be applied to large scale problems in the wild. Distributed training is a must have for real applications.

The ad-hoc option looks good to me, can't wait to try it 👍

gongbudaizhe avatar Jul 13 '22 00:07 gongbudaizhe

Hello @gongbudaizhe,

I apologize for the late reply! I finally implemented the distributed training feature for the multi-node/multi-gpu setting. To try this feature, you should install the nightly version by directly cloning the most recent commit as:

git clone https://github.com/leopard-ai/betty.git
cd betty
pip install .

In detail, you can enable distributed training by 1) setting distributed=True in EngineConfig as:

engine_config=EngineConfig(distributed=True, ...)

and 2) launching the training script with torch.distributed.launch on every node as standard PyTorch distributed training.

In the future, we plan to simplify the launching procedure as Microsoft's DeepSpeed or HuggingFace's Accelerate. To do this, we need to write a custom launcher code. If you are willing to contribute to this, that is very welcome!

I am sorry again for the delay, and let me know if you have any questions!

Best, Sang

sangkeun00 avatar Oct 25 '22 03:10 sangkeun00