Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Results 124 Megatron-DeepSpeed issues
Sort by recently updated
recently updated
newest added

I don't know whether this is intended to work or not, but I found the following program: ``` from megatron.data.indexed_dataset import IndexedDatasetBuilder, best_fitting_dtype best_dtype = best_fitting_dtype(10_000) IndexedDatasetBuilder("testfile", dtype=best_dtype) ``` leads...

**Motivation**. As @sashavor suggested, the carbon footprint working group needs an experiment tracker to properly follow all runs being done. An experiment tracker could also be more broadly interesting to...

🌍 Carbon

mC4 data is too large. For 13 selected language it's around 18TB of data. I excluded the english data since teven already processed it. Arabic, Swahili (Bantu), Chinese, Catalan, English,...

In this issue, we discuss how viable/interesting it might be to implement DeBERTa like attention mechanism: https://arxiv.org/abs/2006.03654 Things to take in account: - performance enhancements: Check with HF pretrained model...

enhancement
arch&scale

After having a 3->8->3 spike in the loss value a few days ago, which luckily recovered after a few hours of training, we want to discuss possible ready to use...

Follow appendix A.1 https://arxiv.org/pdf/1812.06162.pdf to implement monitoring of gradient noise scale and add it to the tensorboard log.

arch&scale

1. A recent commit removed `tools/convert_checkpoint/deepspeed_checkpoint.py` but there is still an attempt to import it in `tools/convert_checkpoint/deepspeed_to_megatron.py`. The other scripts in the folder appear to be ok. I guess the...

Tentative of applying teacher student using Megatron-DeepSpeed WIP draft PR - not supposed to merge cc @thomasw21

Patch rocm-support to fused kernels from Microsoft/Megatron-DeepSpeed-fork.

I want to ask why we can not make tp and pp of a checkpoint bigger? For example, make tp=4 when its original tp is 2. I tried to do...