Zach Mueller
Zach Mueller
@winglian not quite yet! But I'll let you know for you to test :) (should be by end of this week!)
@winglian go ahead and try the branch out :) Note that it only works on single GPU for now (will look at deepspeed tommorow), and you shouldn't see a time...
Correct. I only tested on a tiny model just to get the API stable 😉
Now that it’s a bit more stable, I saw both memory decreases and speed increases when combining MS-AMP and TransformerEngine. More details are in the PR (so overall purely positives)
Correct, I'm looking into that this week
@alex-jw-brooks the idea behind this is indeed as you say :) Flag would be better, and do note that realistically `dispatch_batches` or `split_batches` shouldn't do *anything*, this is full user...
Please give us the full (very long) stack trace
@Ofir408 what is the output of `model.hf_device_map`?
(Somewhat, currently trying to reverse engineer a few ways you did it, you guys would be *much* faster at it I imagine if you want to beat us to it...
Just as a fair warning, this will not be an immediate nor quick fix, since essentially this means every single model's calculation is off when doing `output.loss`, and every single...