Andrew Gu

Results 159 comments of Andrew Gu

> Test Plan: Unit tested checkpoint_wrapper.py by instantizing ActivationWrapper and got TypeError as expected. I am okay with the PR, but the test plan in the description does not match...

Just so we do not forget: We need to experiment with/evaluate whether it is okay or not that if we run `unshard(async_op=True)` today, the all-gather copy-in ops will run in...

Hey @zhouzaida! Sorry, I am going to be out for the next week. Hopefully, my team's next oncall can get to this.

Nice work! I will take a closer look at this when I get a chance.

FSDP2 does pad tensors on the sharded dim. For your original error, I am not sure where it is coming from. It would be helpful to show more of the...

Yes, I think currently the fp8 all-gather assumes the dim-0 is divisible by the world size.

Could you provide some more details like an example trace and some minimal repro of the code? (Otherwise, it is hard for us to understand and help your issue.)

For FSDP, there are two ways to accumulate gradients: 1. Accumulate unsharded gradients (`model.no_sync` context in FSDP1, `model.set_requires_gradient_sync(is_last_microbatch)` in FSDP2) 2. Accumulate sharded gradients We should differentiate between training throughput...

I think in FSDP1, they are accumulated in the dtype that the gradients were computed in (e.g. bf16). In FSDP2, if you specify `MixedPrecisionPolicy(reduce_dtype=torch.float32)`, then it will have extra logic...