Andrew Gu

Results 159 comments of Andrew Gu

@kygguo Are you using `reshard_after_forward=True` / `FULL_SHARD`?

Could you share the args you are passing to FSDP / how you are calling FSDP?

Can you try to pass `sharding_strategy=ShardingStrategy.SHARD_GRAD_OP`?

1. `register_fsdp_forward_method` only works with FSDP2, but you are using FSDP1 :/ Either way, if you are doing `SHARD_GRAD_OP` already, I think it will not help. What you are doing...

@yuxin212 Did you already make sure that shared parameters are either wrapped in the same FSDP module or in parent-child FSDP modules (i.e. _not_ in different FSDP modules that are...

I have not had the chance to try to run your script, but if you want to check for shared parameters, you can run this before you wrap with FSDP:...

Or actually, taking another look, I have another suspicion. The issue might be coming from: ``` model.module.get_embeds(q_input_ids) ``` FSDP only knows to all-gather the parameters when you run the FSDP...

@yuxin212 If you do not wrap the whole model, I think you may not have any overlap of communication/computation 🤔 In particular, we need a parent module of the transformer...

yes :( since `ModuleList` does not implement `forward`, we should actually never wrap `ModuleList` with FSDP since then it would never run the all-gather

> Does this issue affect performance or efficiency? Wrapping `nn.ModuleList` affects correctness :/ Since parameters are not all-gathered, you would see an error (probably a shape mismatch). > If I...