Andrew Gu comments

Results 159 comments of


                                            Andrew Gu

Plan to support FSDP2?

> But looking at the actual code, it's already [2 years old](https://github.com/pytorch/pytorch/blame/799acd31b4ca1c709f1e66dc58f22638e1f7696c/torch/distributed/_composable/__init__.py)! Very sorry for the confusion! There are two separate functions called `fully_shard`, one being 2 years old and...

Shape error when using torchtune.modules.RotaryPositionalEmbeddings

This might be better filed in the torchtune repo?

Process got stuck when trying to optimize different groups of parameters using different types of data

It may help if you can provide a repro of some kind and/or give some more information about what parallelism you are using.

Process got stuck when trying to optimize different groups of parameters using different types of data

Is it possible that you have any kind of conditional computation? For example, one data parallel rank does not receive the multimodal data, so the linear layer did not get...

Process got stuck when trying to optimize different groups of parameters using different types of data

> SO it seems like the current implementation doesn't support such function, right? yea... you might need to feed some dummy data through since this is breaking SPMD semantics, there...

Process got stuck when trying to optimize different groups of parameters using different types of data

I think there is a bit of nuance depending on how you apply FSDP to the model. If you are not directly calling `fully_shard` on that linear but rather some...

float8 training with rowwise scaling

@gau-nernst I will say that supporting row-wise scaling with FSDP2 pre-all-gather is painful, and I am not sure if we should ever do it (at least with FSDP2 -- maybe...

[Bug] FSDP2 FP8 compatibility problem with nn.Linear layers (GPU count > out_features)

IMHO, you probably should not be using FP8 when you have such a small `out_features` size, but having a better check/error message for this might be good.

[FSDP2] Eager-Mode Execution Tracker

Are there any example open source repos for training diffusion models in fp16? We can try to get some bandwidth to look at `DTensor` + grad scaler.

[FSDP2] Eager-Mode Execution Tracker

@nighting0le01 you will get better help if you can post an issue in https://github.com/pytorch/ao with more details.