torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Is there a plan for supporting full fine-tuning 70B model?

Open dmammfl opened this issue 10 months ago • 5 comments

dmammfl avatar Apr 24 '24 23:04 dmammfl

Thanks for creating the issue! I think a full fine-tune for the 70B, even in full bf16, should take 280GB for params + 280GB for gradients out of the box. So even with SGD for an optimizer this'd be fairly close to e.g. 8x A100 80GB out of the box, which is pretty hefty. Can you share more details on your use case? Are you looking to train on more than one node, or do you have other memory-saving techniques you'd be interested in seeing for the full fine-tune of the 70B model?

ebsmothers avatar Apr 24 '24 23:04 ebsmothers

Thanks for replying! I have 2 nodes with 4x 80GB A100, and currently used "accelerate" for llama-2 70B fine-tuning + multi-node FSDP. So, I'm wondering that is there a multi-node training guideline for torchrun and 70B fine-tuning support plan.

dmammfl avatar Apr 25 '24 00:04 dmammfl

In previous work / experiments, I've been able to fit full finetunes of 70B models on 8x 80G A100s and even 8x 40G A100s with CPU offloading (though this resulted in a very slow model).

If there's demand, I can see us adding a version of 70B full finetuning with paged/low-precision optimizers, FSDP wrapping, activation checkpoint, full bf16 + an option for CPU offloading. What do you think @ebsmothers ?

rohan-varma avatar Apr 25 '24 08:04 rohan-varma

@rohan-varma just a +1 for this idea. Please help us GPU poor !! A 70B full finetuning with paged/low-precision optimizer, FSDP wrapping, activation checkpoint, full bf16 + an option for CPU offloading is like a dream!!!

bratao avatar Apr 25 '24 18:04 bratao

Thanks @dmammfl and @bratao for the comments! To expand on what @rohan-varma mentioned, one feature gap here would be enabling CPU offload, which (correct me if I'm wrong) can just be done through the FSDP APIs. Other than that, I think most of these features (like the low-memory optimizers, AC, etc.) can be enabled through configs. The main challenge here is that many bitsandbytes optimizers (definitely AdamW8bit, I am not sure about others offhand) do not compose with FSDP and so you won't be able to save checkpoints. I think this is nontrivial to solve, but @rohan-varma knows much more here than I do so I'll defer to him.

Also, regarding running on multiple nodes, we don't really test for this currently (I only have a single-node environment currently). You can certainly try tune run --nnodes 2 --nproc_per_node 4 ... but no promises it'll work out of the box. If you run into rough edges let us know!

ebsmothers avatar Apr 26 '24 02:04 ebsmothers

hey @dmammfl @bratao @rohan-varma, SFT of 70b is supported by torchtune now. Please let us know if you have any issues/questions with it! :)

https://github.com/pytorch/torchtune/blob/6f37d15b2c99d49ca926173455569aa6f8e24d9d/recipes/configs/llama3/70B_full.yaml#L9

felipemello1 avatar Jun 28 '24 15:06 felipemello1