Andrew Gu
Andrew Gu
Can this PR be landed?
I am guessing this is asking for normal microbatching. There are similar APIs for FSDP2 that can control communication during gradient accumulation. We migrated the `no_sync()` context to directly just...
@XinDongol I think that is sufficient. If you want to avoid reduce-scatter in backward, then what you have is right. Note however that this will mean that gradients are left...
@pytorchbot merge
@pytorchbot rebase -s
@pytorchbot rebase -s
@pytorchbot merge
This PR https://github.com/pytorch/pytorch/pull/91795 looks good to me to fix the first error, but I will let @wanchaol or @fduwjj to approve.
It is not necessary to move the model to GPU before passing to FSDP: ``` model = Net().to(rank) ``` You only need to make sure the model after applying FSDP...
FSDP has some support for deferred initialization if you look at the `param_init_fn` constructor argument, which would allow exceeding the capacity of CPU DRAM. However, the current support is not...