Andrew Gu comments

Results 159 comments of


                                            Andrew Gu

[BE] Make ActivationWrapper an abstract class

> Test Plan: Unit tested checkpoint_wrapper.py by instantizing ActivationWrapper and got TypeError as expected. I am okay with the PR, but the test plan in the description does not match...

[Pipelining] add UNSHARD/RESHARD actions and test helpers

Just so we do not forget: We need to experiment with/evaluate whether it is okay or not that if we run `unshard(async_op=True)` today, the all-gather copy-in ops will run in...

[Pipelining] add UNSHARD/RESHARD actions and test helpers

@kwen2501 yes :)

The ProcessGroupNCCL is not being destructed

Hey @zhouzaida! Sorry, I am going to be out for the next week. Hopefully, my team's next oncall can get to this.

Fix & improve activation offloading

Nice work! I will take a closer look at this when I get a chance.

FSDP 2 doesn't pad tensors?

FSDP2 does pad tensors on the sharded dim. For your original error, I am not sure where it is coming from. It would be helpful to show more of the...

FSDP 2 doesn't pad tensors?

Yes, I think currently the fp8 all-gather assumes the dim-0 is divisible by the world size.

CatArrayBatchedCopy and AllGather don't overlap during FSDP backward

Could you provide some more details like an example trace and some minimal repro of the code? (Otherwise, it is hard for us to understand and help your issue.)

Gradient accumulation is not efficiently implemented for distributed recipes

For FSDP, there are two ways to accumulate gradients: 1. Accumulate unsharded gradients (`model.no_sync` context in FSDP1, `model.set_requires_gradient_sync(is_last_microbatch)` in FSDP2) 2. Accumulate sharded gradients We should differentiate between training throughput...

Gradient accumulation is not efficiently implemented for distributed recipes

I think in FSDP1, they are accumulated in the dtype that the gradients were computed in (e.g. bf16). In FSDP2, if you specify `MixedPrecisionPolicy(reduce_dtype=torch.float32)`, then it will have extra logic...