export_llama_to_onnx icon indicating copy to clipboard operation
export_llama_to_onnx copied to clipboard

使用3090导出 QWen-7b,报OOM问题。

Open linthy94 opened this issue 1 year ago • 3 comments

begin export qwen ============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 ============= verbose: False, log level: Level.ERROR ======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Traceback (most recent call last): File "export_qwen_naive.py", line 175, in export_qwen(args) File "export_qwen_naive.py", line 151, in export_qwen export_qwen_to_single_onnx(model, config, dtype, args, "qwen_onnx") File "export_qwen_naive.py", line 109, in export_qwen_to_single_onnx torch.onnx.export(

……………………

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 23.69 GiB total capacity; 23.32 GiB already allocated; 5.88 MiB free; 23.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

linthy94 avatar Oct 24 '23 02:10 linthy94

使用了单卡 3090,转出的模型是 fp16

linthy94 avatar Oct 24 '23 02:10 linthy94

使用了单卡 3090,转出的模型是 fp16

可以考虑:1. CPU导出, 2.使用更大内存的GPU, 3. 修改代码拆分为多个子模型。

luchangli03 avatar Oct 30 '23 06:10 luchangli03

cpu 下不能导出 fp16 类型的模型,可以指定 device 为 auto,然后修改脚本里面输入的 device 设备,可以支持多卡下导出 onnx ,避免单卡放不下模型的情况:

from transformers.modeling_utils import get_parameter_device

device = get_parameter_device(lm_head_model)
input_data = torch.randn(input_shape, dtype=dtype).to(device)

脚本里面还有好几个需要修改的,自己可以参考上面的修改下。

然后 decoder 合并太多层也会 OOM,可以尝试减小合并 decoder 的层

python export_llama.py -m ../llama/llama-2-7b-chat-hf/ -o ./llama-2-7b-chat/ -d auto --decoder_pack_size 1

aksenventwo avatar Dec 06 '23 09:12 aksenventwo