Zach Mueller

Results 516 comments of Zach Mueller

Confirmed with @stas00 the load balancing looks proper, once tests pass will merge ✅

I observed this as well when I was running some experiments (things were close postfix, but not *exact*). Would you like to take a stab at a PR? :)

That is the issue with it, and why I'm not the biggest fan of that particular solution. We can't, bc there are situations like `IterableDatasets` where that just cannot be...

![W B Chart 10_22_2024, 10_20_21 AM](https://github.com/user-attachments/assets/814ee970-0617-48dd-a65b-0573c5740197) Can confirm the fairseq solution works great, it'll be part of https://github.com/huggingface/transformers/pull/34283

This however does not make any impact as we scale (current fix or these ones) ![image](https://github.com/user-attachments/assets/b99123ff-8887-43c7-a811-84e2c7474893) This might be problem specific, however I did find the fix helped a little

I'll leave this open for now. I didn't see significant discrepancies between DDP and non, but if users have stories/can show where it goes wrong, post them here for us...

What we can do then is add it in under a flag which is disabled by default (`average_tokens_across_devices`) into the `TrainingArguments`. @techkang want to take a stab at a PR?

The main worry with FSDPv2 is if it's stable enough that it makes sense to include it in Accelerate. At the worst case, we can keep a draft PR open...

What'd be helpful on my end is some bare-bones FSDP2 examples in PyTorch with how things are operating end-to-end

Thanks @raghukiran1224 :) Yes indeed I plan on looking into these w/ some of the torch folks. It's in our close future to get something small going. (Probably highly experimental,...