Devin Chotzen-Hartzell

Results 5 issues of Devin Chotzen-Hartzell

**Describe the bug** The distributed optimizer state is being saved in an inefficient way when zarr is used as a backend. This causes slowdowns like the following writes to a...

**Describe the bug** We're in the process of upgrading Megatron-Core from 0.6 to 0.8 and have noticed some problematic behavior with the new distributed async checkpoint saving introduced in mcore...

**Describe the bug** In `megatron/core/models/gpt/gpt_layer_specs.py`, there are state dict key mappings defined to handle the different state dicts induced by fused operations (e.g., layer norm + MLP fc1 fusion). These...

stale

**Describe the bug** I have a setup on a small MoE model on 2 H100s with 2-way EP (DP), 1-way TP/PP. I am feeding the same token sequence into the...

stale

**Describe the bug** When using MLA on a sequence length other than `config.max_position_embeddings`, a tensor shape mismatch error is thrown while applying the positional embeddings, stemming from [this line](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/multi_latent_attention.py#L353). Effectively,...