Stas Bekman

Results 128 issues of Stas Bekman

the paper NormFormer: Improved Transformer Pretraining with Extra Normalization https://arxiv.org/abs/2110.09456 suggests that under preLN: > gradients at earlier layers tend to be larger than gradients at later layers so we...

Good First Issue

This PR is trying to fix: ``` Traceback (most recent call last): File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 83, in main() File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 22, in main ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp, args.target_pp) File...

**Please edit the OP** to add whatever fixes we applied to the core and which need to be propagated upstream into: 1. https://github.com/microsoft/Megatron-DeepSpeed 2. https://github.com/NVIDIA/Megatron-LM we want to do that...

Update released model files to include 1. correct tokenizer files (t5-small or gpt2): 2. fill out the `config.tokenizer_class` HUB: - [ ] https://huggingface.co/bigscience/gpt2-13b-en - [ ] https://huggingface.co/bigscience/gpt2-1b3-en - [ ]...

I just started to look at how to adapt [zero_to_fp32.py](https://github.com/microsoft/DeepSpeed/blob/51a2e916b730cf676c66532b19d973a603377cb0/deepspeed/utils/zero_to_fp32.py) to extract fp32 weights from optimizer states. I will park this for now since it was said today fp16 weights...

since some of us use deepspeed@master for other uses make sure we test against the correct deepspeed branch deepspeed@big-science

trying to sort out how to run only the latest push and cancel the previous one if it's still running.

Let's discuss which data is used in the test suite. And after the discussion turn into guidelines for test writers. Here is a very rough start: * We want to...

After having a 3->8->3 spike in the loss value a few days ago, which luckily recovered after a few hours of training, we want to discuss possible ready to use...

1. Currently it's hard to tell which datasets were used for the benchmark results posted here: https://huggingface.co/Helsinki-NLP/opus-mt-ru-en (and the other models from your user). After quite some digging I derived...