KD_Lib Distributed Training

Distributed Training

Open Het-Shah opened this issue 4 years ago • 5 comments

We need to add support for Distributed training, we can directly make use of Pytorch DDP if we want as of now. Let me know if anyone wants to take this up.

Aug 28 '20 08:08 Het-Shah

I wouldn't mind taking this up. But I'd need a little time to do it. Let me know if that works

Sep 01 '20 10:09 avishreekh

Yea take your time we don't really need to release this immediately anyway.

Sep 01 '20 10:09 Het-Shah

Hi @Het-Shah and @avishreekh, thanks for creating this wonderful library with support for multiple KD algorithms. The code and implementation is nicely done and structured.

Wanted to know if any update on distributed training is being done? Currently if I do python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 vanilla_kd.py the library does not run. Multi-GPU training is crucial for this library to be really useful as both model sizes and data sizes are increasing and we cannot get away from using multiple-GPU's for training.

thanks again!

Apr 17 '21 21:04 srikar2097

Thank you @srikar2097. We are glad that this library could be useful to you. We are working on the distributed training enhancement and hope to release it by Mid-may.

Thank you for your patience.

Apr 20 '21 09:04 avishreekh

There are certain design choices that we are debating on currently. Will add this feature once it is decided how to efficiently accommodate it in the existing framework. Thanks!

May 30 '21 14:05 avishreekh

KD_Lib KD_Lib copied to clipboard

Distributed Training

KD_Lib
KD_Lib copied to clipboard