Mitchell Wortsman

Results 88 comments of Mitchell Wortsman

Hmm. I really don't know. I guess souping + regression may be an open problem. Sorry about that.

Yep looks like a similar error to what I'm seeing.. still haven't been able to resolve mine if anybody has any advice would be much appreciated (https://github.com/openai/triton/issues/1512)

Thanks yea you're probably right. I'm on torch2.0.0+cu118 with triton2.0.0. I'll try torch1.13.1+cu117 and see if that works.

thanks, really appreciate it! i'll mess around with versions (probably later this week) and see if that fixes things

A useful test here could also just be a short training run with and without grad accum such that we'd expect the curves to be identical. If the model with...

Agree thanks for raising. This is in progress but to provide some updates: - Added the following sentence to the readme: "In contrast with other repositories such as Megatron, we...

This is quite weird, thanks a lot for documenting that this is an issue. Just curious, does the behavior go away with `--grad-checkpointing`?

> @mitchellnw Let me check! To clarify what I'm looking for, I'd expect 8x / 12x / 14x batch sizes to fit for 11m vs 160m? Yes totally. Sorry about...