Andrew Gu
Andrew Gu
@pytorchbot merge
We had a land race. We will re-land this PR after a fix on the optimizer state side.
Hi @kartikayk! I was curious to learn more about the activation memory for the workload. - What was the sequence length? - Is this representative of other fine-tuning workloads, or...
cc: @tianyu-l is this issue done?
When a single GPU cannot even fit batch size 1, depending on where the memory is coming from, any form of parallelism may be able to help (e.g. FSDP, TP,...
@lucasjinreal We may need some more information. If there is an OOM in PyTorch, an explicit error should be raised, and the backtrace for that error can show whether the...
> So that, it can be concluded, if one can not use zero3 trainng a model even with bs = 1, then it won't able to do so with FSDP...
@lucasjinreal Can you take a look at https://pytorch.org/tutorials/intermediate/TP_tutorial.html? Note that fully sharded data parallelism (FSDP) is _not_ tensor parallelism, so it would not be expected for tensor parallelism to be...
I think you need to configure each layer to use TP, but you can probably make this into a loop: https://github.com/pytorch/torchtitan/blob/0d09a3243368559fa0ce4dd84a20e084e740d2ee/torchtitan/parallelisms/parallelize_llama.py#L178-L209 This seems more like a fundamental property of tensor...
What do you mean by default? torchtitan has already written the tensor parallel configuration for the Llama model, so you can enable TP from the `.toml` file.