Megatron-LM
Megatron-LM copied to clipboard
Ongoing research training transformer models at scale
**question** @jon-barker hello, jon, I have some questions on the embedding, can you help explain? Why replace F.embedding(masked_input, self.weight) with self.weight[masked_input] in forward() function of class VocabParallelEmbedding? What is the...
**Your question** I've noticed that the functionality of P2P overlap has been implemented for interleaved pipeline scheduler, is there any possibility to port it to 1F1B pipeline scheduler theoretically?
**Describe the bug** I run the NeMo code and get the job stuck with pipeline_model_parallel_size > 1 using 8 GPUs. I run the job using the mainline NeMo and Megatron-lm...
**Incorrect Dataset Shuffling** - Currently, in `gpt_dataset.py`, the dataset is being globally shuffled across epochs rather than within epoch shuffling which is the standard. - Both shuffle index [code](https://github.com/NVIDIA/Megatron-LM/blob/5f9c870f9f24b482509699d206a9dbb00958f6fc/megatron/core/datasets/gpt_dataset.py#L565) and...
When I run tools/merge_mp_partitions.py, it fails with an exception: ``` Traceback (most recent call last): File "merge_mp_partitions.py", line 286, in main() File "merge_mp_partitions.py", line 212, in main merged_model = get_model(model_type)...
which script should we pass the paramaters defined here to in order to launch the model? https://github.com/NVIDIA/Megatron-LM/blob/main/docs/llama2.md#launch-megatron
**Your question** Ask a clear and concise question about Megatron-LM. 
**Your question** Hello, as far as I know about Megatron, I've only seen padding mask for bert implementation. Yet in Huggingface transformers library, the llama model should also take in...
@deepakn94 Hi, I'm diving deep into Megatron-LM's implementation. For DDP wrapper, the current implementation maps each parameter's `main_grad` to grad buffer. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/grad_buffer.py#L272-L280 And then in backward hook, add `grad` to...
**Describe the bug** During our training sessions utilizing Megatron's Mixture of Experts (MoE) layers, we observed a decline in performance occurring at specific steps, with this deterioration manifesting sporadically and...