Megatron-LM issues

[QUESTION] Sample idx, bin files in public domain for trying out pretrain_gpt.py?

2

**Your question** Ask a clear and concise question about Megatron-LM. Can we have a sample idx + bin files as required by the pretrain_gpt.py ? Running tools/preprocess_data.py on some sample...

sambar1729

add hoper llama golden with mcore calling stack

5

Add Hoper llama2 7b mcore gold example

yiakwy-xpu-ml-framework-team

[Bugfix] Fix typo in moe doc

1

This minor patch is to fix typo in moe introduction document.

billishyahao

[BUG] wrong loss scaling when context parallel is on

2

**Describe the bug** Hi, I think there is a bug when context parallel is on and we can discuss it. https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/pretrain_gpt.py#L148 From this [issue](https://github.com/NVIDIA/Megatron-LM/issues/673)，i know the result is same for...

zhaoyinglia

[BUG] pipeline_paralle is not available when pp_size > 2

2

This func is wrong, the program will hang because of the "group" variable. def _batched_p2p_ops( *, tensor_send_prev: Optional[torch.Tensor], tensor_recv_prev: Optional[torch.Tensor], tensor_send_next: Optional[torch.Tensor], tensor_recv_next: Optional[torch.Tensor], group: torch.distributed.ProcessGroup ) after modified: def...

qby10

stale

Batch_input and elapsed time per iteration slow down during model training

2

# Batch_input and elapsed time per iteration slow down during model training ![微信图片编辑_20240629150957](https://github.com/EleutherAI/gpt-neox/assets/140717408/dae875c7-c01f-47e0-8767-aa8fe53cd476) ## Arguments data_impl ....................... mmap........................updated deepspeed_extra_args ............ {'bf16': {'enabled': True}}.updated dynamic_loss_scale .............. True........................updated eval_interval ................... 40000.......................updated eval_iters...

Yuhanleeee

stale

aaa123git

stale

(Pre training mamba with train.sh) Error : GPT2BPETokenizer : assert args.vocab_file is not None

6

Hello, after pre-processing the dataset with a BPE tokenizer, when i launch the 'train.sh' script for mamba i do get this error. In the script it's mentionned that i have...

SkanderBS2024

Megatron-LM
Megatron-LM copied to clipboard

Metadata

[QUESTION] Sample idx, bin files in public domain for trying out pretrain_gpt.py?

add hoper llama golden with mcore calling stack

[Bugfix] Fix typo in moe doc

[BUG] wrong loss scaling when context parallel is on

[BUG] pipeline_paralle is not available when pp_size > 2

Batch_input and elapsed time per iteration slow down during model training

[QUESTION]Splitting large document and bucketing

[bugfix] Fix _warmup_jit_function

[bugfix] Fix the incorrect with-statement

(Pre training mamba with train.sh) Error : GPT2BPETokenizer : assert args.vocab_file is not None

← Metadata

Owner

Metadata

Megatron-LM Megatron-LM copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-LM
Megatron-LM copied to clipboard