使用DDP运行时显存不够，但是使用Model Parallel时可以正常finetune，耗时很大

Open AlexJJJChen opened this issue 2 months ago • 6 comments

nproc_per_node=4 CUDA_VISIBLE_DEVICES=0,1,2,3
NPROC_PER_NODE=$nproc_per_node
swift sft
--model_id_or_path "AI-ModelScope/llava-v1.6-mistral-7b"
--template_type "llava-mistral-instruct"
--custom_train_dataset_path train_swift.json
--custom_val_dataset_path test_swift.json
--dataset_test_ratio "0.15"
--save_steps "20"
--lora_target_modules q_proj k_proj v_proj
--batch_size "8"
--learning_rate "1e-4"
--num_train_epochs "2"
--gradient_accumulation_steps "16"
--eval_batch_size "8"
--use_flash_attn "True"
--add_output_dir_suffix False
--output_dir finetune_output_epoch_100
--logging_dir finetune_output_epoch_100
--max_length -1
--train_dataset_sample -1
--sft_type lora \

--tuner_backend peft \
--quantization_bit 4 \
--bnb_4bit_comp_dtype AUTO \
--ddp_backend nccl \
--check_dataset_strategy warning \
--gradient_checkpointing "True" \
--deepspeed zero3-offload \

网上的解决方案是：

原来代码，load进的数据放在gpu里

pretrain_weight = torch.load(path)['model']

应该改成

pretrain_weight = torch.load(path, map_location=torch.device('cpu'))['model'] model.load_state_dict(pretrain_weight)

Apr 30 '24 08:04 AlexJJJChen

swift swift copied to clipboard

使用DDP运行时显存不够，但是使用Model Parallel时可以正常finetune，耗时很大

原来代码，load进的数据放在gpu里

pretrain_weight = torch.load(path)['model']

应该改成

swift
swift copied to clipboard