Paul Maragakis
Paul Maragakis
FWIW, commit 6c179fa from 2 days ago works fine with -DENABLE_BF16 -DMULTI_GPU (openmpi, nccl, no cuDNN) and gives about 225k tok/s for the gpt2-x model of TinyStories on a single...
In case it helps anyone else figure this out, the exact point that breaks multiGPU training for me on an 8-GPU node in today's TOT, is the first call to...
If one would ever rename `with_columns`, then perhaps a reasonable renaming would be `mutate`, which would bring it in line with tidyverse in R.
This fix works on my end for the small/default gpt2 model (2.1m tok/s on 8 GPU). The current code still breaks for the gpt2-xl model, albeit in a different way,...
I've verified that the fix by ademeure works on my end for GPT2-XL and opened a PR.
Thanks for this PR. I'm very curious to hear if you can share anything about the performance you get with this FSDP training loop in terms of tokens per second...