Chinese-LLaMA-Alpaca 进行指令精调时报 torch.cuda.OutOfMemoryError ,显存溢出

提示：将[ ]中填入x，表示打对钩。提问时删除这行。只保留符合的选项。

详细描述问题

请尽量具体地描述您遇到的问题，必要时给出运行命令。这将有助于我们更快速地定位问题所在。

运行截图或日志

单张显卡 24G 显存
命令执行时显卡监控
报错日志

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.05 GiB (GPU 0; 23.65 GiB total capacity; 20.02 GiB already allocated; 1.82 GiB free; 20.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 441809) of binary: /root/miniconda3/envs/llama/bin/python

bash 脚本

########参数部分########
lr=1e-4
lora_rank=8
lora_alpha=32
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/data/chinese-llama/llama-7B
chinese_tokenizer_path=/data/chinese-llama/llama-7B
dataset_dir=/root/SourceCode/DL/Chinese-LLaMA-Alpaca/data
per_device_train_batch_size=1
per_device_eval_batch_size=1
training_steps=100
gradient_accumulation_steps=1
output_dir=/data/chinese-llama/chinese_alpaca_plus_lora_7b_gitlab
peft_model=/data/chinese-llama/chinese_alpaca_plus_lora_7b
validation_file=/root/SourceCode/DL/Chinese-LLaMA-Alpaca/scripts/validation_file_name/vaidation.json

deepspeed_config_file=ds_zero2_no_offload.json

########启动命令########
torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --seed $RANDOM \
    --fp16 \
    --max_steps ${training_steps} \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --weight_decay 0 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 250 \
    --save_steps 500 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 1 \
    --max_seq_length 512 \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --torch_dtype float16 \
    --validation_file ${validation_file} \
    --peft_path ${peft_model} \
    --ddp_find_unused_parameters False

已经删除参数 --modules_to_save ${modules_to_save} \ 和 --gradient_checkpointing \

24G显存是否无法进行训练？最小需要多少显存？

请提供文本log或者运行截图，以便我们更好地了解问题详情。

必查项目（前三项只保留你要问的）

[x] 基础模型：LLaMA-Plus
[x] 运行系统：Linux
[x] 问题分类：模型训练与精调
[x] （必选）由于相关依赖频繁更新，请确保按照Wiki中的相关步骤执行
[x] （必选）我已阅读FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
[x] （必选）第三方插件问题：例如llama.cpp、text-generation-webui、LlamaChat等，同时建议到对应的项目中查找解决方案

May 22 '23 14:05 huruizhi

这是不是意味着最小需要 28 G 的显存？

May 22 '23 16:05 huruizhi

理论上打开gradient_checkpointing 24G显存是可以训练的

May 23 '23 04:05 iMountTai

理论上打开gradient_checkpointing 24G显存是可以训练的

我再试一下

我试了一下还是不行 OOM

May 23 '23 07:05 huruizhi

@huruizhi 你的解决了吗，我把--nproc_per_node 配置为2 ，用了2颗GPU(24G)也是OOM

May 23 '23 08:05 LiemLin

@huruizhi 你的解决了吗，我把--nproc_per_node 配置为2 ，用了2颗GPU(24G)也是OOM

没有解决，尝试了各种方式，还是OOM，显存始终差一点点

May 23 '23 10:05 huruizhi

就差这么一点点。 pytorch 的原因？

May 23 '23 10:05 huruizhi

不要只删除参数 modules_to_save , 你还要去代码中改 modules_to_save=None, 代码中写了default选项的.

May 23 '23 11:05 Q4n

不要只删除参数 modules_to_save , 你还要去代码中改 modules_to_save=None, 代码中写了default选项的.

代码中default好像是为None，请问你这边解决了么？多少资源可以训练默认参数配置呢？

May 23 '23 11:05 PL2584718785

就差这么一点点。 pytorch 的原因？

你好，请问解决了么大概需要多少资源可以训练呢？

May 23 '23 11:05 PL2584718785

不要只删除参数 modules_to_save , 你还要去代码中改 modules_to_save=None, 代码中写了default选项的.

代码中default好像是为None，请问你这边解决了么？多少资源可以训练默认参数配置呢？

噢确实, 上周的commit改了, 我的版本落后了, 不好意思.

我这里用的24G的显存炼7B的plus似乎是ok的.

May 23 '23 12:05 Q4n

May 23 '23 15:05 huruizhi

就差这么一点点。 pytorch 的原因？

你好，请问解决了么大概需要多少资源可以训练呢？

实测大概需要30G

May 23 '23 18:05 huruizhi

@huruizhi 兄台知道2个24G的GPU怎么分担30G的显存需求，配置了--nproc_per_node为2还是OOM

May 24 '23 01:05 LiemLin

@Q4n

不要只删除参数 modules_to_save , 你还要去代码中改 modules_to_save=None, 代码中写了default选项的.

代码中default好像是为None，请问你这边解决了么？多少资源可以训练默认参数配置呢？

噢确实, 上周的commit改了, 我的版本落后了, 不好意思.

我这里用的24G的显存炼7B的plus似乎是ok的.

HI, 我的改了modules_to_save=None还是OOM，还有别的地方需要修改吗

May 24 '23 05:05 LiemLin

貌似和训练的句子的长度有关

May 26 '23 05:05 xinyu1905

@huruizhi 解决了嘛，我v100也没跑起来

May 27 '23 02:05 jiaohuix

加载peft_model是会更新embed和lm_head的，可以尝试直接微调合并后的模型，然后再调整可训参数、梯度检查、batch_size等

May 27 '23 03:05 iMountTai

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

Jun 03 '23 22:06 github-actions[bot]

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

Jun 06 '23 22:06 github-actions[bot]

@jiaohuix 请问您后来解决了吗，我也是单卡V100跑不起来

Jun 27 '23 02:06 wuzixiaoer

@wuzixiaoer 无哈

Jun 27 '23 05:06 jiaohuix

@wuzixiaoer 无哈 @jiaohuix 大佬多大内存的V100？

Jul 14 '23 08:07 coolge

32g

朱嘉辉 @.***

------------------ 原始邮件 ------------------ 发件人: "shuai @.>; 发送时间: 2023年7月14日(星期五) 下午4:46 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [ymcui/Chinese-LLaMA-Alpaca] 进行指令精调时报 torch.cuda.OutOfMemoryError ,显存溢出 (Issue #406)

@wuzixiaoer 无哈 @jiaohuix 大佬多大内存的V100？

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Jul 14 '23 08:07 jiaohuix

@jiaohuix 请问大佬解决这个显存溢出的问题了吗？

Jul 14 '23 08:07 coolge

@Q4n 大佬，你用的哪个模型呀？Chinese-Alpaca 还是 Chinese-LLaMA？Chinese-Alpaca需要指定LoRA权重目录（--peft_path），这个占用的显存很大，我这边24G跑不起来

Jul 14 '23 09:07 coolge

请问解决觉了吗？我也遇到一样的问题了 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 23.65 GiB total capacity; 22.92 GiB already allocated; 182.06 MiB free; 22.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Nov 29 '23 03:11 binsson

Chinese-LLaMA-Alpaca Chinese-LLaMA-Alpaca copied to clipboard

进行指令精调时报 torch.cuda.OutOfMemoryError ,显存溢出

详细描述问题

运行截图或日志

必查项目（前三项只保留你要问的）

Chinese-LLaMA-Alpaca
Chinese-LLaMA-Alpaca copied to clipboard