Devin Chotzen-Hartzell
Devin Chotzen-Hartzell
**Describe the bug** The distributed optimizer state is being saved in an inefficient way when zarr is used as a backend. This causes slowdowns like the following writes to a...
**Describe the bug** We're in the process of upgrading Megatron-Core from 0.6 to 0.8 and have noticed some problematic behavior with the new distributed async checkpoint saving introduced in mcore...
**Describe the bug** In `megatron/core/models/gpt/gpt_layer_specs.py`, there are state dict key mappings defined to handle the different state dicts induced by fused operations (e.g., layer norm + MLP fc1 fusion). These...
**Describe the bug** I have a setup on a small MoE model on 2 H100s with 2-way EP (DP), 1-way TP/PP. I am feeding the same token sequence into the...
[BUG] Dual meaning of `max_position_embeddings`, computing both embedding shape & yarn scaling base
**Describe the bug** When using MLA on a sequence length other than `config.max_position_embeddings`, a tensor shape mismatch error is thrown while applying the positional embeddings, stemming from [this line](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/multi_latent_attention.py#L353). Effectively,...