transfer-learning-conv-ai
transfer-learning-conv-ai copied to clipboard
Pytorch Lightning as a back-end
Hi Team, spoke with @thomwolf about possibly using Lightning as your backend! This would remove the need to do your own distributed computing and 16-bit stuff.
Check out the simple interface we have!
https://github.com/williamFalcon/pytorch-lightning
Hi @williamFalcon , sry to ask a question out of this context. But I am stuck with 2 differfent issues.
- how to continue training from the checkpoint. Say, after training for 2 epochs, I tried to load the checkpoint and resume training, but was unable to do so.
- What is the exact CUDA & CuDNN etc. configuration needed to import amp ?
to continue training use load_from_metrics. but that won’t reinstate the training cycle. that’s only supported in the cluster case (see hoc_load hoc_save).
we do have a simpler interface coning in the next few days for it. check oit #27
@williamFalcon, ok I would be waiting for that update. FInetuning the model on custom dataset is taking around 8-9 hours on single V100 GPU. Therefore, we need a method to use the saved checkpoint and resume training from there later on.
it’ll be available on master in a few hours
@williamFalcon Sounds cool! Could you create a pull request to integrate pytorch-lightning for multi-node multi-GPU fine-tuning? I want to move beyond multi-GPU single-node fine-tuning (which is what this repo currently seems to support) to full-scale distributed fine-tuning with several nodes.