Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Ongoing research training transformer language models at scale, including: BERT & GPT-2
I don't know whether this is intended to work or not, but I found the following program: ``` from megatron.data.indexed_dataset import IndexedDatasetBuilder, best_fitting_dtype best_dtype = best_fitting_dtype(10_000) IndexedDatasetBuilder("testfile", dtype=best_dtype) ``` leads...
**Motivation**. As @sashavor suggested, the carbon footprint working group needs an experiment tracker to properly follow all runs being done. An experiment tracker could also be more broadly interesting to...
mC4 data is too large. For 13 selected language it's around 18TB of data. I excluded the english data since teven already processed it. Arabic, Swahili (Bantu), Chinese, Catalan, English,...
In this issue, we discuss how viable/interesting it might be to implement DeBERTa like attention mechanism: https://arxiv.org/abs/2006.03654 Things to take in account: - performance enhancements: Check with HF pretrained model...
After having a 3->8->3 spike in the loss value a few days ago, which luckily recovered after a few hours of training, we want to discuss possible ready to use...
Follow appendix A.1 https://arxiv.org/pdf/1812.06162.pdf to implement monitoring of gradient noise scale and add it to the tensorboard log.
1. A recent commit removed `tools/convert_checkpoint/deepspeed_checkpoint.py` but there is still an attempt to import it in `tools/convert_checkpoint/deepspeed_to_megatron.py`. The other scripts in the folder appear to be ok. I guess the...
Tentative of applying teacher student using Megatron-DeepSpeed WIP draft PR - not supposed to merge cc @thomasw21
Patch rocm-support to fused kernels from Microsoft/Megatron-DeepSpeed-fork.
I want to ask why we can not make tp and pp of a checkpoint bigger? For example, make tp=4 when its original tp is 2. I tried to do...