Stas Bekman
Stas Bekman
@Raibows, thank you for providing an easy to use repro - you can use `model_name = 'patrickvonplaten/t5-tiny-random'` while debugging this as it'd be much faster and not require many resources....
Glad you figured it out, @Raibows! That's why we have unit tests that help us know whether the feature is working correctly and when it doesn't for a user often...
- The first few steps lead to an OVERFLOW so optimizer didn't run and thus was fast. it then adjusted the scaling factor each step until it reached one that...
Thank you for the great and easy to reproduce report, @fenchri Indeed, you found a grad accumulation bug in HF Trainer. This is not an bug in DeepSpeed or its...
Hmm, actually looking at earlier steps, this appears to be odd as well: ``` {'loss': 10.8252, 'learning_rate': 3e-05, 'epoch': 0.86} 40%|████████████████████████████████████████▊ | 4/10 [00:00
ok, actually I came up with a fix, will push shortly for you to try Please try https://github.com/huggingface/transformers/pull/22098
> Thanks @stas00 for having a look and apologies for the late reply. Indeed, the fix resolves the issue! tada excellent! Thank you for testing the PR, @fenchri > I...
Thank you for trying to analyse this, @moyix and for wanting to make things faster. I dug into it and here is what I have to share with you. #...
I'm curious, are you doing inference or finetuning? Because for the latter usually the init overhead is usually irrelevant. Fast loading is also important for debug. I think I'm going...
Some additional solutions coming from pytorch-slack where I asked [this question](https://pytorch.slack.com/archives/C3PDTEV8E/p1677813090248699): 1. install pytorch-nightly from instructions at https://pytorch.org/get-started/locally/ (or if you read this later when pytorch==2.0 is released any 2.0...