transfer-learning-conv-ai icon indicating copy to clipboard operation
transfer-learning-conv-ai copied to clipboard

Pytorch Lightning as a back-end

Open williamFalcon opened this issue 6 years ago • 5 comments

Hi Team, spoke with @thomwolf about possibly using Lightning as your backend! This would remove the need to do your own distributed computing and 16-bit stuff.

Check out the simple interface we have!

https://github.com/williamFalcon/pytorch-lightning

williamFalcon avatar Aug 06 '19 19:08 williamFalcon

Hi @williamFalcon , sry to ask a question out of this context. But I am stuck with 2 differfent issues.

  1. how to continue training from the checkpoint. Say, after training for 2 epochs, I tried to load the checkpoint and resume training, but was unable to do so.
  2. What is the exact CUDA & CuDNN etc. configuration needed to import amp ?

nikhiljaiswal avatar Aug 07 '19 05:08 nikhiljaiswal

to continue training use load_from_metrics. but that won’t reinstate the training cycle. that’s only supported in the cluster case (see hoc_load hoc_save).

we do have a simpler interface coning in the next few days for it. check oit #27

williamFalcon avatar Aug 07 '19 10:08 williamFalcon

@williamFalcon, ok I would be waiting for that update. FInetuning the model on custom dataset is taking around 8-9 hours on single V100 GPU. Therefore, we need a method to use the saved checkpoint and resume training from there later on.

nikhiljaiswal avatar Aug 07 '19 10:08 nikhiljaiswal

it’ll be available on master in a few hours

williamFalcon avatar Aug 07 '19 12:08 williamFalcon

@williamFalcon Sounds cool! Could you create a pull request to integrate pytorch-lightning for multi-node multi-GPU fine-tuning? I want to move beyond multi-GPU single-node fine-tuning (which is what this repo currently seems to support) to full-scale distributed fine-tuning with several nodes.

g-karthik avatar Oct 15 '19 06:10 g-karthik