composer Implement distributed trainng

Implement distributed trainng

Open galacticglum opened this issue 4 years ago • 0 comments

This is pretty important since training the Transformer-decoder model is very memory-intensive. Currently, due to the lack of distributed training support, to train the model, we must reduce the hyperparameters (such as layer count or window size) or train a GPU with a lot of memory—pr perhaps a compromise between the two.

The problem is that it is much more difficult to scale up GPU memory capacity than it is to add more GPUs to a compute instance. For example, on Google Cloud Platform, the max GPU memory (per single unit) offered is 16 GB, which is not enough to train a large-variant of the model (i.e. 8 decoder layers at 768-dimensionality embedding space with a 2048 window size; see the default configuration values for more information). Instead, we must add more GPUs to the instance to increase the memory capacity; however, this is pointless without distributed training.

May 11 '20 02:05 galacticglum

composer composer copied to clipboard

Implement distributed trainng

composer
composer copied to clipboard