Andrew Gu

Results 159 comments of Andrew Gu

Hi @shengfukevin! With your PR, I am seeing behavior like this when running our new FSDP implementation that uses a custom autograd function with non-tensor input: in my custom autograd...

I further verified that this expensive CPU overhead happens on every profiled iteration, not just the 1st one. In other words, it is not a 1-time cost.

I deleted an outdated comment. Amazingly, we can use `torch.nn.utils.clip_grad_norm_()` without changing its code at all 😄 (Though, like mentioned in the other comment, I do not think this would...

`linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test (push)` failure unrelated: ``` RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (8, 9, 7) but found runtime version (8, 7, 0). PyTorch already comes...

I am curious to learn more about this for my understanding. Consider one weight `w`. IIUC, we are comparing (1) randomly initialize one single `w` and have each of the...

@BadrYoubiIdrissi This makes sense! I am curious to learn more about your use case of `NO_SHARD`. Is it mainly that it is easy to switch between sharding strategies? Or, is...

I am going to close this issue for now since this is related to FSDP1 `NO_SHARD`, not FSDP2.

I am assuming that this PR is not for landing as is but rather for testing?