Andrew Gu

Results 159 comments of Andrew Gu

> But looking at the actual code, it's already [2 years old](https://github.com/pytorch/pytorch/blame/799acd31b4ca1c709f1e66dc58f22638e1f7696c/torch/distributed/_composable/__init__.py)! Very sorry for the confusion! There are two separate functions called `fully_shard`, one being 2 years old and...

This might be better filed in the torchtune repo?

It may help if you can provide a repro of some kind and/or give some more information about what parallelism you are using.

Is it possible that you have any kind of conditional computation? For example, one data parallel rank does not receive the multimodal data, so the linear layer did not get...

> SO it seems like the current implementation doesn't support such function, right? yea... you might need to feed some dummy data through since this is breaking SPMD semantics, there...

I think there is a bit of nuance depending on how you apply FSDP to the model. If you are not directly calling `fully_shard` on that linear but rather some...

@gau-nernst I will say that supporting row-wise scaling with FSDP2 pre-all-gather is painful, and I am not sure if we should ever do it (at least with FSDP2 -- maybe...

IMHO, you probably should not be using FP8 when you have such a small `out_features` size, but having a better check/error message for this might be good.

Are there any example open source repos for training diffusion models in fp16? We can try to get some bandwidth to look at `DTensor` + grad scaler.

@nighting0le01 you will get better help if you can post an issue in https://github.com/pytorch/ao with more details.