[Feature] Tensor parallelism fine-tuning

Open MrPanch opened this issue 10 months ago • 0 comments

Motivation

Deepspeed provide out of box tensor parallelism. However, when I modify config, for example, internvl_chat/zero_stage3_config.json adding "model_parallelism" parameters to fine 26B model:

"model_parallel": { "enabled": true, "dp_world_size": 6, "tensor_parallel_size": 6, "pipeline_parallel_size": 1, "cpu_offload": true },

the utilization of gpus look like this:

Based on your documentation, to serve single 26B model on 4 gpus it requires 30Gb memory per gpu, and 25806 memory per gpu for 8, so interpolating, I expect to see 28Gb memory utilization, but I get out of memory error

So, even though on page it look like I can fit 26B and fine tune it with batchsize 1 and accumulated batchsize, for example 8, I face this porblem.

I am not sure, that I did everything correct. If my addition to config is wrong, please, let me know. If not: What do you think about adding tensor level parallelism. Thank you in advance

Related resources

No response

Additional context

No response

Feb 26 '25 16:02 MrPanch