project-based-learning icon indicating copy to clipboard operation
project-based-learning copied to clipboard

Add a tutorial for distributed training

Open Sze-qq opened this issue 2 years ago • 0 comments

Description

I'd love to share a tutorial, from which we can learn how to train our AI model in a distributed manner step by step.

Motivation and Context

As the deep learning model are getting large, we might feel it hard to train it in a non-distributed manner. This tutorial clearly illustrates to me that

  • What is distributed training
  • How can I use the efficient parallelization techniques to perform and speed up the AI model training
  • It also provides some advanced tutorials to help me define my own parallel model

It is quite interesting and helpful for a deep learning researcher.

How Has This Been Tested?

There exists many popular parallelization techniques when it comes to distributed training. But I fail to have a general idea of them. This tutorial gives me a clear view of these advanced parallelization techniques and I can apply all of them on my code after I read this tutorial. I feel I can write the distributed deep learning models just like how I write the model on my laptop. Most importantly, it can greatly save my training time. Hence, I recommend to all of you who need to train deep learning models.

Types of changes

  • [ ] Content Update (change which fixes an issue or updates an already existing submission)
  • [x] New Article (change which adds functionality)
  • [ ] Documentation change

Checklist:

  • [x] My code follows the code style of this project.
  • [x] I have updated the documentation accordingly.
  • [x] I have read the CONTRIBUTING document.
  • [x] I have made checks to ensure URLs and other resources are valid

Sze-qq avatar May 23 '22 09:05 Sze-qq