[Feature] Tensor parallelism fine-tuning
Motivation
Deepspeed provide out of box tensor parallelism. However, when I modify config, for example, internvl_chat/zero_stage3_config.json adding "model_parallelism" parameters to fine 26B model:
"model_parallel": { "enabled": true, "dp_world_size": 6, "tensor_parallel_size": 6, "pipeline_parallel_size": 1, "cpu_offload": true },
the utilization of gpus look like this:
Based on your documentation, to serve single 26B model on 4 gpus it requires 30Gb memory per gpu, and 25806 memory per gpu for 8, so interpolating, I expect to see 28Gb memory utilization, but I get out of memory error
So, even though on page it look like I can fit 26B and fine tune it with batchsize 1 and accumulated batchsize, for example 8, I face this porblem.
I am not sure, that I did everything correct. If my addition to config is wrong, please, let me know. If not: What do you think about adding tensor level parallelism. Thank you in advance
Related resources
No response
Additional context
No response