torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Is a PP+FSDP+TP config + toml available for pre-training 405B model ?

Open githubsgi opened this issue 8 months ago • 3 comments

Would appreciate if someone can share a toml file to do PP+FSDP+TP for 405B model.

githubsgi avatar Mar 19 '25 21:03 githubsgi

Hi @githubsgi - we have this one here: https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/train_configs/llama3_405b.toml Some of this will depend on how many gpus and what type of gpu given that memory will be a constraint.

lessw2020 avatar Mar 20 '25 01:03 lessw2020

Thanks, I am familiar with that , where PP is set to 1. All my attempts at setting PP> 1 failed . Does the automatic slicing of layers work wit the 405B model ?

githubsgi avatar Mar 24 '25 17:03 githubsgi

It should work. E.g. see Table 5 in https://github.com/pytorch/torchtitan/blob/main/docs/performance.md Could you provide a detailed bug report so we can help?

tianyu-l avatar Mar 31 '25 05:03 tianyu-l