Andrew Gu comments

Results 159 comments of


                                            Andrew Gu

[FSDP] Do not clean FQNs even for `use_orig_params=True`

@pytorchbot merge

[FSDP] Do not clean FQNs even for `use_orig_params=True`

We had a land race. We will re-land this PR after a fix on the optimizer state side.

[RFC] Single Device Full Fine-tune for Llama7B in < 16GB

Hi @kartikayk! I was curious to learn more about the activation memory for the workload. - What was the sequence length? - Is this representative of other fine-tuning workloads, or...

lr scheduler - update global states into optimizer

cc: @tianyu-l is this issue done?

Question: tp able to run a model which not able to fit a single batch on GPU?

When a single GPU cannot even fit batch size 1, depending on where the memory is coming from, any form of parallelism may be able to help (e.g. FSDP, TP,...

Question: tp able to run a model which not able to fit a single batch on GPU?

@lucasjinreal We may need some more information. If there is an OOM in PyTorch, an explicit error should be raised, and the backtrace for that error can show whether the...

Question: tp able to run a model which not able to fit a single batch on GPU?

> So that, it can be concluded, if one can not use zero3 trainng a model even with bs = 1, then it won't able to do so with FSDP...

Question: tp able to run a model which not able to fit a single batch on GPU?

@lucasjinreal Can you take a look at https://pytorch.org/tutorials/intermediate/TP_tutorial.html? Note that fully sharded data parallelism (FSDP) is _not_ tensor parallelism, so it would not be expected for tensor parallelism to be...

Question: tp able to run a model which not able to fit a single batch on GPU?

I think you need to configure each layer to use TP, but you can probably make this into a loop: https://github.com/pytorch/torchtitan/blob/0d09a3243368559fa0ce4dd84a20e084e740d2ee/torchtitan/parallelisms/parallelize_llama.py#L178-L209 This seems more like a fundamental property of tensor...

Question: tp able to run a model which not able to fit a single batch on GPU?

What do you mean by default? torchtitan has already written the tensor parallel configuration for the Llama model, so you can enable TP from the `.toml` file.