Andrew Gu

Results 159 comments of Andrew Gu

I am guessing this is asking for normal microbatching. There are similar APIs for FSDP2 that can control communication during gradient accumulation. We migrated the `no_sync()` context to directly just...

@XinDongol I think that is sufficient. If you want to avoid reduce-scatter in backward, then what you have is right. Note however that this will mean that gradients are left...

This PR https://github.com/pytorch/pytorch/pull/91795 looks good to me to fix the first error, but I will let @wanchaol or @fduwjj to approve.

It is not necessary to move the model to GPU before passing to FSDP: ``` model = Net().to(rank) ``` You only need to make sure the model after applying FSDP...

FSDP has some support for deferred initialization if you look at the `param_init_fn` constructor argument, which would allow exceeding the capacity of CPU DRAM. However, the current support is not...