Andrew Gu comments

Results 159 comments of


                                            Andrew Gu

Remove unneeded torchvision/audio deps

Can this PR be landed?

[Feature] Add gradient accumulation

I am guessing this is asking for normal microbatching. There are similar APIs for FSDP2 that can control communication during gradient accumulation. We migrated the `no_sync()` context to directly just...

[Feature] Add gradient accumulation

@XinDongol I think that is sufficient. If you want to avoid reduce-scatter in backward, then what you have is right. Note however that this will mean that gradients are left...

[FSDP][BE] Improve `device_id` + CPU offload test

@pytorchbot merge

[Reland][FSDP] Do not clean FQNs for `use_orig_params=True`

@pytorchbot rebase -s

[Reland][FSDP] Do not clean FQNs for `use_orig_params=True`

@pytorchbot rebase -s

[Reland][FSDP] Do not clean FQNs for `use_orig_params=True`

@pytorchbot merge

[PT-D]Track PT-D fixes for release blocker to upgrade to PT 3.11

This PR https://github.com/pytorch/pytorch/pull/91795 looks good to me to fix the first error, but I will let @wanchaol or @fduwjj to approve.

`nn.Module` parameters allocated before warped by FSDP

It is not necessary to move the model to GPU before passing to FSDP: ``` model = Net().to(rank) ``` You only need to make sure the model after applying FSDP...

`nn.Module` parameters allocated before warped by FSDP

FSDP has some support for deferred initialization if you look at the `param_init_fn` constructor argument, which would allow exceeding the capacity of CPU DRAM. However, the current support is not...