Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

Ongoing research training transformer models at scale

Results 294 Megatron-LM issues
Sort by recently updated
recently updated
newest added

Hi, After going across both Megatron-LM & NeMo I've found that NeMo configs set by default the [`MegatronDistributedFusedAdam`](https://github.com/NVIDIA/NeMo/blob/874a1eab03fa49e6a10e00ce9518cba699d7eb37/nemo/core/optim/distributed_adam.py#L95) optimizer from the NeMo framework. But Megatron also contains a [`DistributedOptimizer`](https://github.com/NVIDIA/Megatron-LM/blob/fd3c77115c912e67b831c590bdc4f5e08e42f166/megatron/core/optimizer/distrib_optimizer.py#L65). The...

**Your question** Ask a clear and concise question about Megatron-LM. Could you let me know which version I should revert to if I want to use the previous checkpoint storage...

Hi, I am training my Llama2-7b model with Megatron-LM, using four H20s, 32 GPUs in total. The parallel strategy is set to: TP=8/PP=2/DP=2. Now, I want to know the data...

frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string) + 0x99 (0x7fe76ab98969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional) + 0x1e1 (0x7fe705ea04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7fe705ea81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f...

stale

Describe the bug Get an AtrributeError when trying to convert llama3-8B model from HF format to mcore format, the error is below: `AttributeError: 'Tokenizer' object has no attribute 'vocab_size'` To...

stale

**Your question** I want to ingest a checkpoint from HF into Megatron LM and then continue training on that. For the latter part (training) I will need TP > 1...

**Describe the bug** As shown in the figure above, when calculating `w1` in this part, using `view` will cause element confusion. ![image](https://github.com/NVIDIA/Megatron-LM/assets/39549453/de68effb-5c77-498e-a656-ec99a45ca5b3) As shown in the figure above, it is...

stale

https://github.com/googlecodelabs/tools/issues/903

stale

**Is your feature request related to a problem? Please describe.** I have seen support for training MOE models in Megatron, including scripts for the Mixtral 8x7B model, at: https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/moe.html. However,...

stale

**Describe the bug** The following function (in `megatron.core.tensor_parallel.random`) is called when we initialize the random seeds. Now I am suspecting the behavior of this function doesn't match the docstring, even...

stale