ChatGLM-Efficient-Tuning 多gpu lora 报错

24G 3090上训练

单卡训练lora占用内存 13G 左右

多卡训练报错 RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

监控GPU, 显存也是一直涨到24G然后报错, 是不是显存不够

运行脚本如下:

CUDA_VISIBLE_DEVICES=0,1
torchrun --nnodes=1 --nproc_per_node=2 src/train_sft.py \
    --model_name_or_path /wang/wangmodels/chatglm2-6b \
    --use_v2 \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --finetuning_type lora \
    --lora_rank 8 \
    --max_source_length 128 \
    --max_target_length 128 \
    --output_dir path_to_sft_checkpoint \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --fp16 \
    --ddp_find_unused_parameters False

Jul 10 '23 03:07 neptunear

用 deepspeed 试试

Jul 10 '23 03:07 hiyouga

请问您分布式训练的显存是多大? 我用的24G3090, 是不是显卡的问题

Jul 10 '23 03:07 neptunear

请问解决了吗？同样单卡不报错，然后用accelearate时报同样的错，配置是2*24G 3090。batch_size=1也不行。报错的位置是：

File "/root/miniconda3/envs/llm/lib/python3.8/site-packages/peft/tuners/lora.py", line 565, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling "cublasCreate(handle)"

训练脚本是：

accelerate launch src/train_sft.py \
    --do_train \
    --dataset adgen_train \
    --finetuning_type lora \
    --output_dir adgen_lora \
    --overwrite_cache \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 2000 \
    --learning_rate 1e-3 \
    --num_train_epochs 2.0 \
    --lora_rank 4 \
    --max_source_length 128 \
    --max_target_length 128 \
    --ddp_find_unused_parameters False \
    --source_prefix 你现在是一名销售员，根据以下商品标签生成一段有吸引力的商品广告词。 \
    --plot_loss \
    --fp16

配置accelerate时，也设置了deepspeed stage=2，也报同样的错。

Jul 12 '23 06:07 BeerTai

+1

Jul 27 '23 11:07 liuyijiang1994

请问解决了吗？+1

Aug 07 '23 09:08 lileilai

ChatGLM-Efficient-Tuning ChatGLM-Efficient-Tuning copied to clipboard

多gpu lora 报错

ChatGLM-Efficient-Tuning
ChatGLM-Efficient-Tuning copied to clipboard