Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Results 124 Megatron-DeepSpeed issues
Sort by recently updated
recently updated
newest added

* Only check for `position_embedding_type` if the field exists for the checkpoint-loaded args. * Only load optimizer/lr scheduler states if user provide optimizer and lr scheduler.

a new test to reproduce the issue with BNB when switching from 1 replica to 2 (i.e. DP degree changes, while keeping PP and TP degrees the same): the original...

We need to have a diagnostic model size dumped during the framework init. We currently get a report per rank and not the total. ``` > number of parameters on...

Good First Issue

In `training.py`, we have https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd1e1da967c74e598acfc011031474663ef5845e/megatron/training.py#L818 However, this appears to be wasted compute since the model parameter count does not change. We can refactor the code so that `get_parameters_in_billions` is called...

This PR addresses #114 to check that Megatron-DS to Hugging Face transformers works as intended.

test

we will need to hack https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/module.py#L378-L384 to support `partition_method` `type:embed:2|transformer:1` - or something like that - now the embed weights will get 2x partitioning weights and will get its own...

Good First Issue

## Motivation #177 allows for train iterations to be skipped. This functionality is achieved by using a separate internal counter to keep track of skipped iterations instead of tinkering with...

It takes forever to build the Meg cuda kernels as it does it sequentially and doesn't take advantage of multiple cores. It takes some 5 minutes to build. And every...

Good First Issue
Good Difficult Issue

working on debugging on a live checkpoint (with optim states) but with a small custom dataset.