Megatron-LM issues

Results 294 Megatron-LM issues

Sort by recently updated

Fix typos

Fix typos

Encoder and Decoder has different TP_SIZE

**Your question** 1. I have a question about creating the `pp` groups when enabling `context_parallel_size > 1` and `encoder_tensor_parallel_size != tensor_parallel_size`. When enabling `context_parallel`, the input will be split symmetrically...

heavyrain-lzy

[BUG] Zarr checkpoint loses distributed optimizer states due to lack of synchronizers on ranks that create arrays

**Describe the bug** When using a Zarr distributed checkpoint and a distributed optimizer, each rank writes optimizer states according to ShardedTensor's flattened_range. The Zarr strategy uses synchronizers to ensure the...

LouChao98

[BUG] init_method not declared

https://github.com/NVIDIA/Megatron-LM/blob/54f1f78529cbc2b9cddad313e7f9d96ac0420a27/megatron/legacy/model/multiple_choice.py#L42

qazwsx74269

Megatron-LM
Megatron-LM copied to clipboard

Metadata

Fix typos

Encoder and Decoder has different TP_SIZE

[BUG] Zarr checkpoint loses distributed optimizer states due to lack of synchronizers on ranks that create arrays

[BUG] init_method not declared

← Metadata

Owner

Metadata

Megatron-LM Megatron-LM copied to clipboard

Metadata

Fix typos

Encoder and Decoder has different TP_SIZE

[BUG] Zarr checkpoint loses distributed optimizer states due to lack of synchronizers on ranks that create arrays

[BUG] init_method not declared

← Metadata

Owner

Metadata

Megatron-LM
Megatron-LM copied to clipboard