LMFlow icon indicating copy to clipboard operation
LMFlow copied to clipboard

Can we train llama-13B model with model parallelism?

Open mengjiexu opened this issue 1 year ago • 2 comments

Now we can train llama-7B model on one RTX 3090, can we train llama-13B model with two RTX3090 with model parallelism?

mengjiexu avatar May 01 '23 08:05 mengjiexu

Feel free to try it by changing the deepspeed config.

shizhediao avatar May 01 '23 09:05 shizhediao

Thanks for your interest in LMFlow! I think configs/ds_config_zero3.json provides model parallelism (which also uses cpu offload for optimizer states and model parameters) and can be used for your case. Hope that helps. Thanks 😄

research4pan avatar May 01 '23 11:05 research4pan

Thanks for your interest in LMFlow! I think configs/ds_config_zero3.json provides model parallelism (which also uses cpu offload for optimizer states and model parameters) and can be used for your case. Hope that helps. Thanks 😄

I found that configs/ds_config_zero3.json works well for model parallelism on a single node with 8 A100s. However, when I attempted to fine-tune all parameters for llama-65B on two nodes (each with 8 A100s), it seemed as though we had a full model on each node. Is there any configuration that would allow us to initialize only one model instance using model parallelism across the two nodes? Thank you~

csyourui avatar May 23 '23 01:05 csyourui

Thanks for your interest in LMFlow! According to our experience, the default behavior of deepspeed zero3 is highly probable to be model parallel across nodes. We test llama-33b in our servers, which cannot be trained on a single node, but can be trained on multiple nodes. Since configs/ds_config_zero3.json also offloads states to RAM dynamically, the GPU memory may not be an exact indicator of model sizes in GPU. However, that's just our experience and conjecture. To confirm this behavior with certainty, you may raise an issue in deepspeed. Thanks very much 😄

research4pan avatar May 25 '23 08:05 research4pan