CogVLM2
CogVLM2 copied to clipboard
[question] Zero-3 fine-tuning
Feature request / 功能建议
Firstly, thanks for your work, the model is quite effective at understanding images.
Could you explain what in the model architecture prevents it from being trained tensor-parallel i.e. sharded across multiple GPUs, using deepspeed zero-3? (I would try to get it to work but since i'm not so familiar with distributed training, would prefer to first get a brief explanation regarding this possibility). Since 8xA100 is quite a requirement, it would be useful if people could fine-tune the model with a smaller GPU VRAM budget.
Motivation / 动机
8x A100 may be out-of-reach for many researchers.
Your contribution / 您的贡献
NaN