Megatron-LM
Megatron-LM copied to clipboard
Ongoing research training transformer models at scale
Fix the bug where the optimizer doesn't actually use multi_tensor_applier under float16, because overflow_buf is always False. Specifically, `overflow_buf = self._dummy_overflow_buf`, and `self._dummy_overflow_buf` is initialized as `torch.tensor([0], dtype=torch.int, device='cuda')` under...
[BUG]
**Describe the bug** A clear and concise description of what the bug is. **To Reproduce** Steps to reproduce the behavior. The easier it is to reproduce the faster it will...
**Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] **Describe the solution you'd...
Fixed the bug that prevents configuring datasets using train-data-path, valid-data-path, and test-data-path. When the --split parameter is not configured, the --split parameter will be set to the default value 969,...
**Describe the bug** When I configure datasets for a training task using train-data-path, valid-data-path, and test-data-path, running the training task results in an error. The error message is shown in...
**Your question** When we want to make a training in LLMs with a lot of corpora, I understand that the usual approach is to introduce the documents with the following...
**Your question** Ask a clear and concise question about Megatron-LM. Is `backward` below supposed to be `forward` ? 
**Your question** Is there a way to start training on a llama2 with a llama3 tokenizer? I plan on doing all the pretraining myself, if so and someone can provide...
**Describe the bug** When the sequence of calculation parameters (FP16/BF16) in the buffer is different from the forward execution sequence of the model: As a result, when the `--overlap-param-gather` command...
**Describe the bug** The file format output by `python examples/multimodal/clip_converter.py` does not match the file format required by `examples/multimodal/combine_mistral_clip.sh`. `xxx\state_dict_tp_x.pt` is not `xxx/iter_0000001/mp_rank_00/model_optim_rng.pt`? **To Reproduce** - **Expected behavior** File format...