qwen72B训练完RM后,预测的时候会报Memory错误
错误信息:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.63 GiB (GPU 1; 79.32 GiB total capacity; 55.82 GiB already allocated; 7.58 GiB free; 69.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. SeWed Jul 17 20:42:18 2024[1,1]
训练的BS和deepspeed参数相同,训练时候是ok的。请教一下大佬是哪些参数设置有问题么?
推理脚本如下:
#!/bin/bash set -eux
MASTER_ADDR=$(echo $PADDLE_TRAINERS | cut -d',' -f1) MASTER_PORT=13478 model_name="Qwen1.5-0.5B" model_name="Qwen2-72B" model_name_or_path='/root/paddlejob/workspace/env_run/LLaMA-Factory-0.7.0/rm_checkpoints/Qwen2-72B___rm_solo' train_dataset=rm_eval35
OMPI_COMM_WORLD_LOCAL_RANK=$PADDLE_TRAINER_ID
for stage3 offload_optimizer, which requires ninja.
PATH="${PATH}:/opt/conda/envs/llama_factory/bin"
/opt/conda/envs/llama_factory/bin/torchrun
--nproc_per_node 8
--nnodes $PADDLE_TRAINERS_NUM
--node_rank $PADDLE_TRAINER_ID
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
./src/train_bash.py
--cutoff_len 4096
--packing True
--stage rm
--template qwen
--finetuning_type full
--num_train_epochs 1.0
--plot_loss
--bf16
--overwrite_output_dir
--report_to "none"
--do_predict
--model_name_or_path ${model_name_or_path}
--dataset ${train_dataset}
--output_dir ./result/${model_name}___${train_dataset}
--deepspeed examples/deepspeed/ds_z3_offload_config.json
--per_device_train_batch_size 1
--learning_rate 1e-5