TensorRT-LLM
TensorRT-LLM copied to clipboard
A10 cant convert qwen1.5-7b model?
root@a2968a9db901:/TensorRT-LLM/examples/qwen# python3 convert_checkpoint.py --qwen_type qwen2 --model_dir ./Qwen1.5-7B-Chat --output_dir ./Qwen1.5-7B-Chat-c-model
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024043000
0.10.0.dev2024043000
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.21it/s]
Traceback (most recent call last):
File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 365, in <module>
main()
File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 357, in main
convert_and_save_hf(args)
File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 319, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size, args)
File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 325, in execute
f(args, rank)
File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 305, in convert_and_save_rank
qwen = from_hugging_face(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1086, in from_hugging_face
weights = load_weights_from_hf(config=config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1192, in load_weights_from_hf
weights = convert_hf_qwen(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 892, in convert_hf_qwen
get_tllm_linear_weight(split_v, tllm_prex + 'mlp.proj.', None,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 497, in get_tllm_linear_weight
results[prefix + postfix] = weight.clone()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 18.62 MiB is free. Process 2753675 has 1.67 GiB memory in use. Process 2764636 has 20.28 GiB memory in use. Of the allocated memory 19.79 GiB is allocated by PyTorch, and 232.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
how can I resolve it, old version can set max input/output token
Could you try setting --load_model_on_cpu
?
Could you try setting
--load_model_on_cpu
?
I think it will works ok,I had changed the device-mapping to cpu, it works ok