Andrew Gu comments

Results 159 comments of


                                            Andrew Gu

[Feature] Add gradient accumulation

> Is there a way to accumulate the gradient by keeping a running sum, and just do loss.backward() after finishing all the microbatch? What is the advantage of doing this?...

[Feature] Add gradient accumulation

Each data parallel worker will compute unsharded gradients (i.e. gradients with the original shape / no FSDP sharding) for its local batch in the backward pass. During that backward pass,...

[Feature] Add gradient accumulation

FSDP is already doing that :) How parameters/gradients are grouped together is determined by how you call the API on modules. The common approach is to group each transformer block...

[Feature] Add gradient accumulation

no compiler for eager mode can take a look at here: https://github.com/pytorch/torchtitan/blob/1923ce4db3018a69d2463a6efd7e1ae44cb02ec6/torchtitan/parallelisms/parallelize_llama.py#L289

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods

From [`mosaicml/composer`](https://github.com/mosaicml/composer/blob/a40b8ba988e82990cfddb2696e5db7c8890cab8a/composer/models/huggingface.py#L679-L688): ``` # Note: We need to use the FSDP.summon_full_params context manager here because the generate function # does not seem to gather the weights for the LM head....

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods

> is this making an assumption that user-defined fwd eg forward_features won't call hooks from nn.Module ? otherwise we will have fsdp_hook(forward_features(fsdp_hook)) Since the user-defined forward method (e.g. `forward_features`) is...

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods

> forward_features is under user's control? I guess we are ignoring the chance that user call nn.Module.forward_hooks explicitly in forward_features ? That is a good point. We are assuming that...

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods

> > user-defined method for that one particular module. Any nested submodules will run `forward` normally > > is the particular module mostly root module? like `model.generate()` ? Yes. I...

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods

@pytorchbot merge

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods

@gaotianyu1350 Sorry, this does not apply to `torch.distributed.fsdp.fully_sharded_data_parallel.FullyShardedDataParallel`. The current workaround for that is to use `summon_full_params(recurse=False)`.