torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

GPU Middle Class?

Open EugenHotaj opened this issue 11 months ago • 7 comments

Does torchtune have any plans to support "GPU middle class" users?

We're trying to evaluate using torchtune for post-training, especially since there are many useful features implemented (RLHF, LORA, etc). However, one big sticking point is that the system seems heavily geared towards single-node training. Are there plans to support multi-node training (e.g. 16-64 nodes) and things like model parallelism, 128k context training, etc?

If not, is torchtitan the recommended system to use?

Thanks!

EugenHotaj avatar Dec 16 '24 17:12 EugenHotaj

Hey @EugenHotaj - glad you're checking out torchtune. Up til now, we've managed to provide pretty extensive offerings including long-context, large models up to 405B, and RLHF all on single node. This has allowed people will smaller GPU budgets to fine-tune some pretty incredible models and develop new features faster b/c single node is much easier to debug.

Now, all that said, torchtune technically already supports multi-node for FSDP. And we plan on adding tensor parallel + model parallel very soon. The absolute latest we will have these features in torchtune is end of January, but I would bet on sooner!

Would you need anything beyond these parallelism techniques, e.g. pipeline parallel? Are you running on MAST or something like SLURM?

joecummings avatar Dec 16 '24 18:12 joecummings

And we plan on adding tensor parallel + model parallel very soon. The absolute latest we will have these features in torchtune is end of January, but I would bet on sooner!

Thanks @joecummings that's awesome to hear!

Would you need anything beyond these parallelism techniques, e.g. pipeline parallel? Are you running on MAST or something like SLURM.

Yes we use SLURM -- I'm currently trying to hack a multi-node run from your suggestions on #2018 and torchtitan, so having some examples in torchtune would be super useful imo. We'd also take all the parallelisms we can get 😃, e.g. model, pipeline, and attention parallelism for longer context.

EugenHotaj avatar Dec 16 '24 18:12 EugenHotaj

I second SLURM! I have also been trying to hack this into torchtune since the single-node experience is quite good.

tginart avatar Dec 17 '24 23:12 tginart

Thanks folks for the interest! Us torchtune devs are evidently not in the GPU middle class yet 😅 and I think only @joecummings has access to a multi-node setup as of today. I know he is working on testing this out, but until then @EugenHotaj we would love to include any SLURM scripts you're able to put together as part of our documentation.

ebsmothers avatar Dec 19 '24 01:12 ebsmothers

@ebsmothers the torchtitan SLRUM file worked pretty much out of the box for us since we have a similar cluster setup (p5s on aws). I was able to run Llama 3.3 70B full finetuning on 16 nodes with no issues 😄 .

EugenHotaj avatar Dec 19 '24 02:12 EugenHotaj

@EugenHotaj Thanks for the tip.

Did you use something like https://github.com/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py as the entry point to replace "./train.py" in line 63 ?

tginart avatar Dec 19 '24 08:12 tginart

@tginart right you have to replace that torchrun line with something like:

srun torchrun --nnodes 4 --nproc_per_node 8 --rdzv_id $SLURM_JOB_ID --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:29500" recipes/full_finetune_distributed.py --config recipes/configs/llama3_3/70B_full.yaml

EugenHotaj avatar Dec 19 '24 14:12 EugenHotaj

@EugenHotaj @joecummings Count me among the interested parties. I've only got 4 H100s per node, so I would really like to use more than one :)

csiefer2 avatar Jan 13 '25 18:01 csiefer2

Definitely… if this becomes support I’d love to beta test an official multi node recipe.

On Mon, Jan 13, 2025 at 10:08 AM Chris Siefert @.***> wrote:

@EugenHotaj https://github.com/EugenHotaj @joecummings https://github.com/joecummings Count me among the interested parties. I've only got 4 H100s per node, so I would really like to use more than one :)

— Reply to this email directly, view it on GitHub https://github.com/pytorch/torchtune/issues/2161#issuecomment-2587843748, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACW2HQDNGZ67ZHDU3PNZZLD2KP6ILAVCNFSM6AAAAABTWUCJXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBXHA2DGNZUHA . You are receiving this because you were mentioned.Message ID: @.***>

tginart avatar Jan 13 '25 18:01 tginart

I forgot to update here but I can confirm @EugenHotaj 's approach of using the torchtitan slurm file (with a few tweaks that are probably specific to your own slurm env) works with torchtune.

I think just throwing a sample version of this torchtitan slurm file with some instructions into this repo should basically be enough to complete this work item.

tginart avatar Jan 23 '25 02:01 tginart

Hey - super glad to hear it works @tginart! I figured the actual changes to the multi-node script are minimal, but from our side we want to test as many of the configs and options as possible to make sure we don't hit any weird errors. I swear I'm in the middle of doing this, but I had to set up my own SLURM cluster (since Meta uses something else) so just waiting for all those jobs to finish running. Then, I'll publish some documentation and declare multi-node official open for business in torchtune :)

slurm@computeinstance-e00mw5pqj8y0730220:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 4     debug torchtun    slurm  R       6:24      2 slurm-worker-[1-2]

joecummings avatar Jan 23 '25 15:01 joecummings

#2301

joecummings avatar Feb 10 '25 18:02 joecummings