Megatron-LM issues

[QUESTION] Megatron-LM `DistributedOptimizer` or NeMo `MegatronDistributedFusedAdam` Optimizer?

Hi, After going across both Megatron-LM & NeMo I've found that NeMo configs set by default the [`MegatronDistributedFusedAdam`](https://github.com/NVIDIA/NeMo/blob/874a1eab03fa49e6a10e00ce9518cba699d7eb37/nemo/core/optim/distributed_adam.py#L95) optimizer from the NeMo framework. But Megatron also contains a [`DistributedOptimizer`](https://github.com/NVIDIA/Megatron-LM/blob/fd3c77115c912e67b831c590bdc4f5e08e42f166/megatron/core/optimizer/distrib_optimizer.py#L65). The...

TJ-Solergibert

[QUESTION] Checkpoint storage format

**Your question** Ask a clear and concise question about Megatron-LM. Could you let me know which version I should revert to if I want to use the previous checkpoint storage...

syx11237744

[QUESTION]

Hi, I am training my Llama2-7b model with Megatron-LM, using four H20s, 32 GPUs in total. The parallel strategy is set to: TP=8/PP=2/DP=2. Now, I want to know the data...

suzewei

terminate called after throwing an instance of 'c10::DistBackendError'

2

frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string) + 0x99 (0x7fe76ab98969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional) + 0x1e1 (0x7fe705ea04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7fe705ea81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f...

wccccp

stale

[BUG]Get an AtrributeError when trying to convert llama3-8B model from HF format to mcore format

3

Describe the bug Get an AtrributeError when trying to convert llama3-8B model from HF format to mcore format, the error is below: `AttributeError: 'Tokenizer' object has no attribute 'vocab_size'` To...

nakroy

stale

[QUESTION]How to convert a huggingface checkpoint, and also use PP > 1 or TP > 1

**Your question** I want to ingest a checkpoint from HF into Megatron LM and then continue training on that. For the latter part (training) I will need TP > 1...

sambar1729

[BUG] GroupedMLP calculation problem.

3

**Describe the bug** As shown in the figure above, when calculating `w1` in this part, using `view` will cause element confusion. ![image](https://github.com/NVIDIA/Megatron-LM/assets/39549453/de68effb-5c77-498e-a656-ec99a45ca5b3) As shown in the figure above, it is...

Baibaifan

stale

Projeto liliti stk 3.6.9 acabou

1

https://github.com/googlecodelabs/tools/issues/903

felipeliliti

stale

[ENHANCEMENT]How, or rather, is there any support provided for MOE models of Qwen2MoeForCausalLM in the transformers library?

1

**Is your feature request related to a problem? Please describe.** I have seen support for training MOE models in Megatron, including scripts for the Mixtral 8x7B model, at: https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/moe.html. However,...

liangshaopeng

stale

[BUG] Mismatch Between Docstring and Behavior in core.tensor_parallel.random.model_parallel_cuda_manual_seed

1

**Describe the bug** The following function (in `megatron.core.tensor_parallel.random`) is called when we initialize the random seeds. Now I am suspecting the behavior of this function doesn't match the docstring, even...

cong-bai

stale

Megatron-LM
Megatron-LM copied to clipboard

Metadata

[QUESTION] Megatron-LM `DistributedOptimizer` or NeMo `MegatronDistributedFusedAdam` Optimizer?

[QUESTION] Checkpoint storage format

[QUESTION]

terminate called after throwing an instance of 'c10::DistBackendError'

[BUG]Get an AtrributeError when trying to convert llama3-8B model from HF format to mcore format

[QUESTION]How to convert a huggingface checkpoint, and also use PP > 1 or TP > 1

[BUG] GroupedMLP calculation problem.

Projeto liliti stk 3.6.9 acabou

[ENHANCEMENT]How, or rather, is there any support provided for MOE models of Qwen2MoeForCausalLM in the transformers library?

[BUG] Mismatch Between Docstring and Behavior in core.tensor_parallel.random.model_parallel_cuda_manual_seed

← Metadata

Owner

Metadata

Megatron-LM Megatron-LM copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-LM
Megatron-LM copied to clipboard