Andrew Gu

Results 159 comments of Andrew Gu

re 2: seems https://github.com/pytorch/torchtitan/issues/62 was closed too early? cc: @tianyu-l we can just get rid of the `strict=True` though -- I agree it is not too necessary

IIUC, the default SDPA backend for us is flash, and flash backward is non-deterministic? I think we can try to enable some deterministic SDPA: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

I agree with this. cc: @wconstab @H-Huang we need to discuss how we should do `cilp_grad_norm_` with PP. Given our current design, we cannot solely rely on `nn.utils.clip_grad_norm_`. Each parameter...

If you apply `fully_shard` to each transformer block and then to the root module, this should work for tied embedding and final linear. The root module will manage both.

Do you want to train with 2D parallelism (FSDP + TP)? With TP only?

@yzhangcs sorry I am not as familiar with the checkpointing part. @fegin can you give some guidance here? Should the DCP implementation in torchtitan support parameter sharing?

`F.scaled_dot_product_attention` calls into flash or memory efficient attention depending on some factors (should be mainly flash for the torchtitan case iiuc). Are there other ops that you have in mind?

@casper-hansen Makes sense! I guess it should not be too hard for users to install xformers and replace the `F.scaled_dot_product_attention_call` with the xformers attention call. This should work as long...