Andrew Gu
Andrew Gu
> Is there a way to accumulate the gradient by keeping a running sum, and just do loss.backward() after finishing all the microbatch? What is the advantage of doing this?...
Each data parallel worker will compute unsharded gradients (i.e. gradients with the original shape / no FSDP sharding) for its local batch in the backward pass. During that backward pass,...
FSDP is already doing that :) How parameters/gradients are grouped together is determined by how you call the API on modules. The common approach is to group each transformer block...
no compiler for eager mode can take a look at here: https://github.com/pytorch/torchtitan/blob/1923ce4db3018a69d2463a6efd7e1ae44cb02ec6/torchtitan/parallelisms/parallelize_llama.py#L289
From [`mosaicml/composer`](https://github.com/mosaicml/composer/blob/a40b8ba988e82990cfddb2696e5db7c8890cab8a/composer/models/huggingface.py#L679-L688): ``` # Note: We need to use the FSDP.summon_full_params context manager here because the generate function # does not seem to gather the weights for the LM head....
> is this making an assumption that user-defined fwd eg forward_features won't call hooks from nn.Module ? otherwise we will have fsdp_hook(forward_features(fsdp_hook)) Since the user-defined forward method (e.g. `forward_features`) is...
> forward_features is under user's control? I guess we are ignoring the chance that user call nn.Module.forward_hooks explicitly in forward_features ? That is a good point. We are assuming that...
> > user-defined method for that one particular module. Any nested submodules will run `forward` normally > > is the particular module mostly root module? like `model.generate()` ? Yes. I...
@pytorchbot merge
@gaotianyu1350 Sorry, this does not apply to `torch.distributed.fsdp.fully_sharded_data_parallel.FullyShardedDataParallel`. The current workaround for that is to use `summon_full_params(recurse=False)`.