Devin Chotzen-Hartzell issues

Results 5 issues of


                                            Devin Chotzen-Hartzell

[BUG] Checkpoint saving is slow for zarr backend + distributed optimizer

**Describe the bug** The distributed optimizer state is being saved in an inefficient way when zarr is used as a backend. This causes slowdowns like the following writes to a...

[BUG] Some checkpoint shards don't save / hang on multi-node setups, since v0.7

**Describe the bug** We're in the process of upgrading Megatron-Core from 0.6 to 0.8 and have noticed some problematic behavior with the new distributed async checkpoint saving introduced in mcore...

[BUG] Checkpoint state dict remapping is not applied for MLA layers

**Describe the bug** In `megatron/core/models/gpt/gpt_layer_specs.py`, there are state dict key mappings defined to handle the different state dicts induced by fused operations (e.g., layer norm + MLP fc1 fusion). These...

stale

[BUG] Token routing probability all-gather precision in token_dispatcher causes differing results between EP ranks

**Describe the bug** I have a setup on a small MoE model on 2 H100s with 2-way EP (DP), 1-way TP/PP. I am feeding the same token sequence into the...

stale

[BUG] Dual meaning of `max_position_embeddings`, computing both embedding shape & yarn scaling base

**Describe the bug** When using MLA on a sequence length other than `config.max_position_embeddings`, a tensor shape mismatch error is thrown while applying the positional embeddings, stemming from [this line](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/multi_latent_attention.py#L353). Effectively,...