TensorRT-LLM A10 cant convert qwen1.5-7b model?

root@a2968a9db901:/TensorRT-LLM/examples/qwen# python3 convert_checkpoint.py --qwen_type qwen2 --model_dir ./Qwen1.5-7B-Chat --output_dir ./Qwen1.5-7B-Chat-c-model
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024043000
0.10.0.dev2024043000
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.21it/s]
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 365, in <module>
    main()
  File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 357, in main
    convert_and_save_hf(args)
  File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 319, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 325, in execute
    f(args, rank)
  File "/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 305, in convert_and_save_rank
    qwen = from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1086, in from_hugging_face
    weights = load_weights_from_hf(config=config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1192, in load_weights_from_hf
    weights = convert_hf_qwen(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 892, in convert_hf_qwen
    get_tllm_linear_weight(split_v, tllm_prex + 'mlp.proj.', None,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 497, in get_tllm_linear_weight
    results[prefix + postfix] = weight.clone()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 18.62 MiB is free. Process 2753675 has 1.67 GiB memory in use. Process 2764636 has 20.28 GiB memory in use. Of the allocated memory 19.79 GiB is allocated by PyTorch, and 232.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

how can I resolve it, old version can set max input/output token

May 07 '24 08:05 FightingMan

Could you try setting --load_model_on_cpu?

May 15 '24 06:05 byshiue

Could you try setting --load_model_on_cpu?

I think it will works ok,I had changed the device-mapping to cpu, it works ok

May 15 '24 12:05 FightingMan

TensorRT-LLM TensorRT-LLM copied to clipboard

A10 cant convert qwen1.5-7b model?

TensorRT-LLM
TensorRT-LLM copied to clipboard