Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Results 124 Megatron-DeepSpeed issues
Sort by recently updated
recently updated
newest added

Hello, what version of the megatron-lm library is your code modified?

I got an `AssertionError: Mask is silently ignored due to the use of a custom kernel` when training GPT-2 with `examples/pretrain_gpt.sh`. This line leads to the assertion error: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/8387ae17c4704f6579f88a84500b535d19d7fbbf/megatron/model/fused_softmax.py#L191 Is...

Do you know why i got this problem with `pretrain_gpt_single_node.sh`? I'm setting `N_GPUS=1` and got ``` File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 191, in _get_group_rank raise RuntimeError("The given group does not exist") RuntimeError:...

This PR aims to add an option to use [FlashAttention](https://github.com/HazyResearch/flash-attention). Inspired by https://github.com/NVIDIA/Megatron-LM/pull/267 cc @thomasw21

Hello, can Megatron-DeepSpeed pre-train llama2? Can give a sample script?

d norm: nan | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.886 | TFLOPs: 78.46 | iteration 5426/...

hi, I looked up a lot of information. but I still don't understand the difference between zero-3 and megatron with zero-2. they all split the model.

i start the job the i met this error: cuda: 12.0 torch: 1.14 deepspeed --num_gpus 2 pretrain_gpt_v2.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --distributed-backend nccl --num-layers 2 --hidden-size 64 --num-attention-heads 2 --seq-length...

Hello. When using tensor parallel on bloom (tp_size = 8), we find that the cross_entropy loss computed by mpu.cross_entropy is different from torch.nn.functional.cross_entropy. The difference is about 1% for our...