export_llama_to_onnx 使用3090导出 QWen-7b，报OOM问题。

begin export qwen ============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 ============= verbose: False, log level: Level.ERROR ======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Traceback (most recent call last): File "export_qwen_naive.py", line 175, in export_qwen(args) File "export_qwen_naive.py", line 151, in export_qwen export_qwen_to_single_onnx(model, config, dtype, args, "qwen_onnx") File "export_qwen_naive.py", line 109, in export_qwen_to_single_onnx torch.onnx.export(

……………………

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 23.69 GiB total capacity; 23.32 GiB already allocated; 5.88 MiB free; 23.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Oct 24 '23 02:10 linthy94

使用了单卡 3090，转出的模型是 fp16

Oct 24 '23 02:10 linthy94

使用了单卡 3090，转出的模型是 fp16

可以考虑：1. CPU导出, 2.使用更大内存的GPU, 3. 修改代码拆分为多个子模型。

Oct 30 '23 06:10 luchangli03

cpu 下不能导出 fp16 类型的模型，可以指定 device 为 auto，然后修改脚本里面输入的 device 设备，可以支持多卡下导出 onnx ，避免单卡放不下模型的情况：

from transformers.modeling_utils import get_parameter_device

device = get_parameter_device(lm_head_model)
input_data = torch.randn(input_shape, dtype=dtype).to(device)

脚本里面还有好几个需要修改的，自己可以参考上面的修改下。

然后 decoder 合并太多层也会 OOM，可以尝试减小合并 decoder 的层

python export_llama.py -m ../llama/llama-2-7b-chat-hf/ -o ./llama-2-7b-chat/ -d auto --decoder_pack_size 1

Dec 06 '23 09:12 aksenventwo

export_llama_to_onnx export_llama_to_onnx copied to clipboard

使用3090导出 QWen-7b，报OOM问题。

export_llama_to_onnx
export_llama_to_onnx copied to clipboard