Zach Mueller
Zach Mueller
Confirmed with @stas00 the load balancing looks proper, once tests pass will merge ✅
I observed this as well when I was running some experiments (things were close postfix, but not *exact*). Would you like to take a stab at a PR? :)
That is the issue with it, and why I'm not the biggest fan of that particular solution. We can't, bc there are situations like `IterableDatasets` where that just cannot be...
 Can confirm the fairseq solution works great, it'll be part of https://github.com/huggingface/transformers/pull/34283
This however does not make any impact as we scale (current fix or these ones)  This might be problem specific, however I did find the fix helped a little
I'll leave this open for now. I didn't see significant discrepancies between DDP and non, but if users have stories/can show where it goes wrong, post them here for us...
What we can do then is add it in under a flag which is disabled by default (`average_tokens_across_devices`) into the `TrainingArguments`. @techkang want to take a stab at a PR?
The main worry with FSDPv2 is if it's stable enough that it makes sense to include it in Accelerate. At the worst case, we can keep a draft PR open...
What'd be helpful on my end is some bare-bones FSDP2 examples in PyTorch with how things are operating end-to-end
Thanks @raghukiran1224 :) Yes indeed I plan on looking into these w/ some of the torch folks. It's in our close future to get something small going. (Probably highly experimental,...