Andrew Gu comments

Results 159 comments of


                                            Andrew Gu

Two trivial fixes

re 2: seems https://github.com/pytorch/torchtitan/issues/62 was closed too early? cc: @tianyu-l we can just get rid of the `strict=True` though -- I agree it is not too necessary

reproducable numerics for loss, weights and gradients for single node (8 GPUs)

IIUC, the default SDPA backend for us is flash, and flash backward is non-deterministic? I think we can try to enable some deterministic SDPA: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

Gradient norm clipping with pipeline parallelism (PP)

I agree with this. cc: @wconstab @H-Huang we need to discuss how we should do `cilp_grad_norm_` with PP. Given our current design, we cannot solely rely on `nn.utils.clip_grad_norm_`. Each parameter...

grouped_mm illegal memory access when the input tensor size > 1640968*4096 for llama4

seems like int64 indexing

Support Gemma2 in torchtitan

If you apply `fully_shard` to each transformer block and then to the root module, this should work for tied embedding and final linear. The root module will manage both.

Support Gemma2 in torchtitan

Do you want to train with 2D parallelism (FSDP + TP)? With TP only?

Support Gemma2 in torchtitan

@yzhangcs sorry I am not as familiar with the checkpointing part. @fegin can you give some guidance here? Should the DCP implementation in torchtitan support parameter sharing?

why is xformers not used for attention computation?

`F.scaled_dot_product_attention` calls into flash or memory efficient attention depending on some factors (should be mainly flash for the torchtitan case iiuc). Are there other ops that you have in mind?

why is xformers not used for attention computation?

@casper-hansen Makes sense! I guess it should not be too hard for users to install xformers and replace the `F.scaled_dot_product_attention_call` with the xformers attention call. This should work as long...