MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

Finetuning Configuration

Open WAILMAGHRANE opened this issue 1 year ago • 10 comments

Hi, Could you let me know if everyone has successfully fine-tuned the model? Additionally, I have a question about GPU requirements: is 31.2GB needed per GPU, or is it split between two GPUs? Also, I noticed that Kaggle offers 2 T4 GPUs—are these sufficient for fine-tuning my model with a custom dataset? Thanks! Screenshot 2024-06-01 013114

WAILMAGHRANE avatar Jun 01 '24 14:06 WAILMAGHRANE

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.

if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

YuzaChongyi avatar Jun 02 '24 07:06 YuzaChongyi

hi, @YuzaChongyi can we finetune this model with one A100 (40G)?

whyiug avatar Jun 03 '24 10:06 whyiug

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution.

if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

@whyiug If you have only one gpu, it means you can't reduce memory with zero sharding. But you can still reduce gpu memory by zero-offload , and this is the minimum memory configuration, you can try it.

YuzaChongyi avatar Jun 03 '24 11:06 YuzaChongyi

31.2GB per GPU was tested with two A100 GPU, you can use zero3 + offload to minimize the memory usage. And according to the deepspeed zero strategy, the more GPUs you have, the lower memory usage of each GPU. The final memory usage is also related to the max input length and the image resolution. if you have two T4 GPUs, you can try it by setting use_lora=true, tune_vision=false, batch_size=1, a suitable model_max_length and zero3 config.

@whyiug If you have only one gpu, it means you can't reduce memory with zero sharding. But you can still reduce gpu memory by zero-offload , and this is the minimum memory configuration, you can try it.

Yeah, I only have an A100 card(40G). After I use finetune_lora.sh with those settings:

--model_max_length 1024
--per_device_train_batch_size 1
--deepspeed ds_config_zero3.json

It reports an error:

RuntimeError: "erfinv_cuda" not implemented for 'BFloat16'

maybe on this line(https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/20aecf8831d1d7a3da19bd62f44d1aea82df7fee/resampler.py#L85)

Please tell me how to fix it quickly by changing the code or configuration. Thanks for your quick reply:) @YuzaChongyi

whyiug avatar Jun 03 '24 12:06 whyiug

I haven't encountered this error yet, it may be caused by certain pytorch versions or other reasons. If there is a error during the resampler initialization step, you can comment this line because the ckpt will load and reset model state_dict. Or you can use --fp16 true instead of bf16

YuzaChongyi avatar Jun 04 '24 07:06 YuzaChongyi

please set --bf16 false
--bf16_full_eval false
--fp16 true
--fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16

qyc-98 avatar Jun 04 '24 13:06 qyc-98

if you only have an A100, and change ds_config_zero3.json as follows to offload params and optmizer to cpu to save your memory: "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true } }

qyc-98 avatar Jun 04 '24 14:06 qyc-98

if you only have an A100, and change ds_config_zero3.json as follows to offload params and optmizer to cpu to save your memory: "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true } }

yeah, i already did.

whyiug avatar Jun 04 '24 14:06 whyiug

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16

I changed this, but still got following error. File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

zhu-j-faceonlive avatar Jun 05 '24 13:06 zhu-j-faceonlive

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16请设置--bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \这是因为zero3与bf16不兼容。请使用fp16

I changed this, but still got following error.我改变了这个,但仍然出现以下错误。 File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads 文件“/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py”,第 2117 行,位于 unscale_and_clip_grads 中 self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale) self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1./combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! RuntimeError:预期所有张量都在同一设备上,但发​​现至少有两个设备,cuda:0和cpu!

在双卡4090上这样改我也是同样报错,请问解决了吗,谢谢~

shituo123456 avatar Jun 06 '24 00:06 shituo123456

please set --bf16 false --bf16_full_eval false --fp16 true --fp16_full_eval true \ this is because zero3 is not compatiable with bf16.please use fp16请设置 --BF16 false --bf16_full_eval false --FP16 true --fp16_full_eval true \ 这是因为 zero3 与 bf16 不兼容。请使用 fp16

I changed this, but still got following error.我更改了这一点,但仍然出现以下错误。 File "/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads文件“/home/paperspace/miniconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py”,第 2117 行,unscale_and_clip_grads self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!RuntimeError:预计所有张量都在同一设备上,但发现至少有两个设备,cuda:0 和 cpu!

这个问题一般出现在zero3上, git clone https://github.com/microsoft/DeepSpeed.git cd Deepspeed DS_BUILD_CPU_ADAM=1 pip install .

LDLINGLINGLING avatar Jul 04 '24 09:07 LDLINGLINGLING