Megatron-DeepSpeed issues

Results 124 Megatron-DeepSpeed issues

Sort by recently updated

Hello, what version of the megatron-lm library is your code modified?

Is this assertion for mask wrong?

I got an `AssertionError: Mask is silently ignored due to the use of a custom kernel` when training GPT-2 with `examples/pretrain_gpt.sh`. This line leads to the assertion error: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/8387ae17c4704f6579f88a84500b535d19d7fbbf/megatron/model/fused_softmax.py#L191 Is...

yinfangchen

The given group does not exist pytorch

Do you know why i got this problem with `pretrain_gpt_single_node.sh`? I'm setting `N_GPUS=1` and got ``` File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 191, in _get_group_rank raise RuntimeError("The given group does not exist") RuntimeError:...

germanjke

Add FlashAttention

This PR aims to add an option to use [FlashAttention](https://github.com/HazyResearch/flash-attention). Inspired by https://github.com/NVIDIA/Megatron-LM/pull/267 cc @thomasw21

NouamaneTazi

Hello, can Megatron-DeepSpeed pre-train llama2?

Hello, can Megatron-DeepSpeed pre-train llama2? Can give a sample script?

13416157913

RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'

zll0000

the traing log like this is Normal？ I do not find loss in the logs, and what does the "grad norm: nan" mean?

alphanlp

The difference between zero-3 and megatron with zero-2

hi, I looked up a lot of information. but I still don't understand the difference between zero-3 and megatron with zero-2. they all split the model.

nicosouth

Exception: cuda rng state model-parallel-rng is not added

i start the job the i met this error: cuda: 12.0 torch: 1.14 deepspeed --num_gpus 2 pretrain_gpt_v2.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --distributed-backend nccl --num-layers 2 --hidden-size 64 --num-attention-heads 2 --seq-length...

520jefferson

Question about the implementation of mpu.cross_entropy when using tensor parallel

Hello. When using tensor parallel on bloom (tp_size = 8), we find that the cross_entropy loss computed by mpu.cross_entropy is different from torch.nn.functional.cross_entropy. The difference is about 1% for our...

robin087

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard

Metadata

Hello, what version of the megatron-lm library is your code modified?

Is this assertion for mask wrong?

The given group does not exist pytorch

Add FlashAttention

Hello, can Megatron-DeepSpeed pre-train llama2?

RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'

the traing log like this is Normal？ I do not find loss in the logs, and what does the "grad norm: nan" mean?

The difference between zero-3 and megatron with zero-2

Exception: cuda rng state model-parallel-rng is not added

Question about the implementation of mpu.cross_entropy when using tensor parallel

← Metadata

Owner

Metadata

Megatron-DeepSpeed Megatron-DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard