[question] Zero-3 fine-tuning

Open bkuster0 opened this issue 1 year ago • 0 comments

Feature request / 功能建议

Firstly, thanks for your work, the model is quite effective at understanding images.

Could you explain what in the model architecture prevents it from being trained tensor-parallel i.e. sharded across multiple GPUs, using deepspeed zero-3? (I would try to get it to work but since i'm not so familiar with distributed training, would prefer to first get a brief explanation regarding this possibility). Since 8xA100 is quite a requirement, it would be useful if people could fine-tune the model with a smaller GPU VRAM budget.

Motivation / 动机

8x A100 may be out-of-reach for many researchers.

Your contribution / 您的贡献

NaN

Jun 03 '24 10:06 bkuster0