Megatron-LM issues

[QUESTION]why replace F.embedding() with [] on VocabParallelEmbedding class?

**question** @jon-barker hello, jon, I have some questions on the embedding, can you help explain? Why replace F.embedding(masked_input, self.weight) with self.weight[masked_input] in forward() function of class VocabParallelEmbedding? What is the...

starkhu

[QUESTION] Does the functionality of P2P overlap working for 1F1B scheduler?

3

**Your question** I've noticed that the functionality of P2P overlap has been implemented for interleaved pipeline scheduler, is there any possibility to port it to 1F1B pipeline scheduler theoretically?

wuziyou199217

[BUG] DeadLock when using GQA and Mcore

2

**Describe the bug** I run the NeMo code and get the job stuck with pipeline_model_parallel_size > 1 using 8 GPUs. I run the job using the mainline NeMo and Megatron-lm...

xuwenju123

Incorrect shuffling of documents across epochs in GPTDataset

1

**Incorrect Dataset Shuffling** - Currently, in `gpt_dataset.py`, the dataset is being globally shuffled across epochs rather than within epoch shuffling which is the standard. - Both shuffle index [code](https://github.com/NVIDIA/Megatron-LM/blob/5f9c870f9f24b482509699d206a9dbb00958f6fc/megatron/core/datasets/gpt_dataset.py#L565) and...

argitrage

merge_mp_partitions.py fails with an exception

9

When I run tools/merge_mp_partitions.py, it fails with an exception: ``` Traceback (most recent call last): File "merge_mp_partitions.py", line 286, in main() File "merge_mp_partitions.py", line 212, in main merged_model = get_model(model_type)...

gcooper-isi

[QUESTION] which script to use to launch llama 2 fine tuning?

4

which script should we pass the paramaters defined here to in order to launch the model? https://github.com/NVIDIA/Megatron-LM/blob/main/docs/llama2.md#launch-megatron

ppt0011

stale

[QUESTION] why pipeline-model-parallel size should be greater than 2 with interleaved schedule ？

4

**Your question** Ask a clear and concise question about Megatron-LM. ![image](https://github.com/NVIDIA/Megatron-LM/assets/34435196/33f48852-3cc4-4464-b0e2-02e468b9ec12)

nullnonenilNULL

[QUESTION] Should llama or gpt-like models have padding attention mask?

5

**Your question** Hello, as far as I know about Megatron, I've only seen padding mask for bert implementation. Yet in Huggingface transformers library, the llama model should also take in...

kisseternity

[QUESTION] For DDP, why map parameter's main_grad to grad buffer instead of grad?

1

@deepakn94 Hi, I'm diving deep into Megatron-LM's implementation. For DDP wrapper, the current implementation maps each parameter's `main_grad` to grad buffer. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/grad_buffer.py#L272-L280 And then in backward hook, add `grad` to...

wuxibin89

stale

[BUG] Permormance drop while training with MoE

8

**Describe the bug** During our training sessions utilizing Megatron's Mixture of Experts (MoE) layers, we observed a decline in performance occurring at specific steps, with this deterioration manifesting sporadically and...

Teng-xu

stale

Megatron-LM
Megatron-LM copied to clipboard

Metadata

[QUESTION]why replace F.embedding() with [] on VocabParallelEmbedding class?

[QUESTION] Does the functionality of P2P overlap working for 1F1B scheduler?

[BUG] DeadLock when using GQA and Mcore

Incorrect shuffling of documents across epochs in GPTDataset

merge_mp_partitions.py fails with an exception

[QUESTION] which script to use to launch llama 2 fine tuning?

[QUESTION] why pipeline-model-parallel size should be greater than 2 with interleaved schedule ？

[QUESTION] Should llama or gpt-like models have padding attention mask?

[QUESTION] For DDP, why map parameter's main_grad to grad buffer instead of grad?

[BUG] Permormance drop while training with MoE

← Metadata

Owner

Metadata

Megatron-LM Megatron-LM copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-LM
Megatron-LM copied to clipboard