Megatron-DeepSpeed issues

Results 124 Megatron-DeepSpeed issues

Sort by recently updated

Tweaks for lm-eval-harness

* Only check for `position_embedding_type` if the field exists for the checkpoint-loaded args. * Only load optimizer/lr scheduler states if user provide optimizer and lr scheduler.

zphang

[bnb] resume with more replicas test

a new test to reproduce the issue with BNB when switching from 1 replica to 2 (i.e. DP degree changes, while keeping PP and TP degrees the same): the original...

stas00

Need model size dumped at init

We need to have a diagnostic model size dumped during the framework init. We currently get a report per rank and not the total. ``` > number of parameters on...

stas00

Good First Issue

Avoid re-computing model parameter count every iteration

In `training.py`, we have https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd1e1da967c74e598acfc011031474663ef5845e/megatron/training.py#L818 However, this appears to be wasted compute since the model parameter count does not change. We can refactor the code so that `get_parameters_in_billions` is called...

jaketae

Test: Add checkpoint conversion test code

This PR addresses #114 to check that Megatron-DS to Hugging Face transformers works as intended.

jaketae

test

[deepspeed pipe] expand the partitioning method to support weights

we will need to hack https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/module.py#L378-L384 to support `partition_method` `type:embed:2|transformer:1` - or something like that - now the embed weights will get 2x partitioning weights and will get its own...

stas00

Good First Issue

Good Difficult Issue

[wip] debug with new data

working on debugging on a live checkpoint (with optim states) but with a small custom dataset.

stas00

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard

Metadata

Tweaks for lm-eval-harness

[bnb] resume with more replicas test

Need model size dumped at init

Avoid re-computing model parameter count every iteration

Test: Add checkpoint conversion test code

[deepspeed pipe] expand the partitioning method to support weights

Add skip iterations to `sample_idxs_to_text.py`

Fix `sample_idxs_to_text.py` to account for skips

Parallelize Meg CUDA Kernel build system

[wip] debug with new data

← Metadata

Owner

Metadata

Megatron-DeepSpeed Megatron-DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard