Stas Bekman issues

Results 128 issues of


                                            Stas Bekman

extract and log grad norm for individual layers

the paper NormFormer: Improved Transformer Pretraining with Extra Normalization https://arxiv.org/abs/2110.09456 suggests that under preLN: > gradients at earlier layers tend to be larger than gradients at later layers so we...

Good First Issue

[chkpt conversion] handle the case where tp=0 , should be 1

This PR is trying to fix: ``` Traceback (most recent call last): File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 83, in main() File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 22, in main ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp, args.target_pp) File...

Issue to gather Fixes + New features to send upstream

**Please edit the OP** to add whatever fixes we applied to the core and which need to be propagated upstream into: 1. https://github.com/microsoft/Megatron-DeepSpeed 2. https://github.com/NVIDIA/Megatron-LM we want to do that...

update converted models to include tokenizer files

Update released model files to include 1. correct tokenizer files (t5-small or gpt2): 2. fill out the `config.tokenizer_class` HUB: - [ ] https://huggingface.co/bigscience/gpt2-13b-en - [ ] https://huggingface.co/bigscience/gpt2-1b3-en - [ ]...

[WIP] [fp32 checkpoint] very early experiments with extracting fp32 params

I just started to look at how to adapt [zero_to_fp32.py](https://github.com/microsoft/DeepSpeed/blob/51a2e916b730cf676c66532b19d973a603377cb0/deepspeed/utils/zero_to_fp32.py) to extract fp32 weights from optimizer states. I will park this for now since it was said today fp16 weights...

[requirements] check we test agains the correct deepspeed branch

since some of us use deepspeed@master for other uses make sure we test against the correct deepspeed branch deepspeed@big-science

wip [CI] dealing with concurrency

trying to sort out how to run only the latest push and cancel the previous one if it's still running.

[testing] data size / dynamic downloads - test speed and repo bloat

Let's discuss which data is used in the test suite. And after the discussion turn into guidelines for test writers. Here is a very rough start: * We want to...

recovering from loss spikes strategies

After having a 3->8->3 spike in the loss value a few days ago, which luckily recovered after a few hours of training, we want to discuss possible ready to use...

adding links to the source datasets in benchmarks

1. Currently it's hard to tell which datasets were used for the benchmark results posted here: https://huggingface.co/Helsinki-NLP/opus-mt-ru-en (and the other models from your user). After quite some digging I derived...