Qwen2 VL cannot be convert to checkpoint on TensorRT-LLM
System Info
- CPU: x86
- GPU: 2xL40S
- Memory: 256GB
- System: Ubuntu 22.04
- Docker Image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
- TensorRT-LLM version: 0.16.0
Who can help?
I have tested the examples under examples/multimodal. But when I try to convert the Qwen2-VL-7B to checkpoint via python3 ../qwen/convert_checkpoint.py --model_dir Qwen2-VL-7B-Instruct \ --output_dir trt_models/Qwen2-VL-7B-Instruct/fp16/1-gpu \ --dtype float16, I got the error Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}, seems the Qwen2-VL is not supported. Is it due to the docker image I used or I have build the trtllm from the source?
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- Cd to examples/multimodal
- Run
python3 ../qwen/convert_checkpoint.py --model_dir Qwen2-VL-7B-Instruct \ --output_dir trt_models/Qwen2-VL-7B-Instruct/fp16/1-gpu \ --dtype float16
Expected behavior
Got trt_models/Qwen2-VL-7B-Instruct/fp16/1-gpu without any errors.
actual behavior
Got error log:
root@04292e29d243:/workspace/TensorRT-LLM/examples/multimodal# python3 ../qwen/convert_checkpoint.py --model_dir Qwen2-VL-7B-Instruct \ --output_dir trt_models/Qwen2-VL-7B-Instruct/fp16/1-gpu \ --dtype float16 2025-01-03 11:20:24.426668: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2025-01-03 11:20:24.441389: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1735903224.456763 2272 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1735903224.461320 2272 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-03 11:20:24.477010: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. [TensorRT-LLM] TensorRT-LLM version: 0.16.0 0.16.0 Unrecognized keys in rope_scalingfor 'rope_type'='default': {'mrope_section'} Unrecognized keys inrope_scalingfor 'rope_type'='default': {'mrope_section'} Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/functional.py", line 656, in from_string return RotaryScalingType[s] ~~~~~~~~~~~~~~~~~^^^ File "/usr/lib/python3.12/enum.py", line 814, in __getitem__ return cls._member_map_[name] ~~~~~~~~~~~~~~~~^^^^^^ KeyError: 'default'
additional notes
I have tried Phi-3 vision, Qwen2-7B-instruct as well, both of them works.
@sunnyqgg would u please take a look this issue?
Hi, Please use the latest main code and run "pip install -r requirements-qwen2vl.txt" firstly.
Thanks.
I tried to rebuild the docker image with latest source code on the main branch. The checkpoint converting has been fixed for Qwen2-VL. However, the run.py seems still not working for Qwen2-VL.
I have tried python run.py \ --hf_model_dir Qwen2-VL-7B-Instruct \ --visual_engine_dir trt_engines/Qwen2-VL-7B-Instruct/vision_encoder \ --llm_engine_dir trt_engines/Qwen2-VL-7B-Instruct/fp16/1-gpu/ \ --image_path=merlion.png
But got root@00d9a1ccd86f:/workspace/TensorRT-LLM/examples/multimodal# python run.py \ --hf_model_dir Qwen2-VL-7B-Instruct \ --visual_engine_dir trt_engines/Qwen2-VL-7B-Instruct/vision_encoder \ --llm_engine_dir trt_engines/Qwen2-VL-7B-Instruct/fp16/1-gpu/ \ --image_path=merlion.png 2025-01-10 08:19:36.099445: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2025-01-10 08:19:36.114432: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1736497176.130732 10771 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1736497176.135485 10771 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-10 08:19:36.152056: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. [TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700 [TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API. [01/10/2025-08:19:39] [TRT-LLM] [I] Loading engine from trt_engines/Qwen2-VL-7B-Instruct/vision_encoder/model.engine [01/10/2025-08:19:39] [TRT-LLM] [I] Creating session from engine trt_engines/Qwen2-VL-7B-Instruct/vision_encoder/model.engine [01/10/2025-08:19:39] [TRT] [I] Loaded engine size: 1303 MiB [01/10/2025-08:19:40] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +498, now: CPU 0, GPU 1791 (MiB) [01/10/2025-08:19:40] [TRT-LLM] [I] Running LLM with C++ runner [TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 3072 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (3072) * 28 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 3071 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value. [TensorRT-LLM][INFO] Loaded engine size: 14549 MiB [TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues... [TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1000.03 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 16332 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.49 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.72 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 44.52 GiB, available: 27.02 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 7116 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Max KV cache pages per sequence: 48 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.32 GiB for max tokens in paged KV cache (455424). [TensorRT-LLM][INFO] Enable MPI KV cache transport. [01/10/2025-08:19:51] [TRT-LLM] [I] Load engine takes: 10.98725938796997 sec Traceback (most recent call last): File "/workspace/TensorRT-LLM/examples/multimodal/run.py", line 88, in <module> input_text, output_text = model.run(args.input_text, raw_image, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/runtime/multimodal_model_runner.py", line 1989, in run input_text, pre_prompt, post_prompt, processed_image, decoder_input_ids, other_vision_inputs, other_decoder_inputs = self.setup_inputs( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/runtime/multimodal_model_runner.py", line 1728, in setup_inputs processor.apply_chat_template(msg, ^^^^^^^^^ NameError: name 'processor' is not defined. Did you mean: 'self.processor'? [TensorRT-LLM][INFO] Refreshed the MPI local session.
Any suggestion here? Thanks!
I tried to rebuild the docker image with latest source code on the main branch. The checkpoint converting has been fixed for Qwen2-VL. However, the run.py seems still not working for Qwen2-VL.
I have tried
python run.py \ --hf_model_dir Qwen2-VL-7B-Instruct \ --visual_engine_dir trt_engines/Qwen2-VL-7B-Instruct/vision_encoder \ --llm_engine_dir trt_engines/Qwen2-VL-7B-Instruct/fp16/1-gpu/ \ --image_path=merlion.pngBut got
root@00d9a1ccd86f:/workspace/TensorRT-LLM/examples/multimodal# python run.py \ --hf_model_dir Qwen2-VL-7B-Instruct \ --visual_engine_dir trt_engines/Qwen2-VL-7B-Instruct/vision_encoder \ --llm_engine_dir trt_engines/Qwen2-VL-7B-Instruct/fp16/1-gpu/ \ --image_path=merlion.png 2025-01-10 08:19:36.099445: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0. 2025-01-10 08:19:36.114432: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1736497176.130732 10771 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1736497176.135485 10771 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-10 08:19:36.152056: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. [TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700 [TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API. [01/10/2025-08:19:39] [TRT-LLM] [I] Loading engine from trt_engines/Qwen2-VL-7B-Instruct/vision_encoder/model.engine [01/10/2025-08:19:39] [TRT-LLM] [I] Creating session from engine trt_engines/Qwen2-VL-7B-Instruct/vision_encoder/model.engine [01/10/2025-08:19:39] [TRT] [I] Loaded engine size: 1303 MiB [01/10/2025-08:19:40] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +498, now: CPU 0, GPU 1791 (MiB) [01/10/2025-08:19:40] [TRT-LLM] [I] Running LLM with C++ runner [TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 3072 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (3072) * 28 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 3071 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value. [TensorRT-LLM][INFO] Loaded engine size: 14549 MiB [TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues... [TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1000.03 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 16332 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 11.49 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.72 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 44.52 GiB, available: 27.02 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 7116 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Max KV cache pages per sequence: 48 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 24.32 GiB for max tokens in paged KV cache (455424). [TensorRT-LLM][INFO] Enable MPI KV cache transport. [01/10/2025-08:19:51] [TRT-LLM] [I] Load engine takes: 10.98725938796997 sec Traceback (most recent call last): File "/workspace/TensorRT-LLM/examples/multimodal/run.py", line 88, in <module> input_text, output_text = model.run(args.input_text, raw_image, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/runtime/multimodal_model_runner.py", line 1989, in run input_text, pre_prompt, post_prompt, processed_image, decoder_input_ids, other_vision_inputs, other_decoder_inputs = self.setup_inputs( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/runtime/multimodal_model_runner.py", line 1728, in setup_inputs processor.apply_chat_template(msg, ^^^^^^^^^ NameError: name 'processor' is not defined. Did you mean: 'self.processor'? [TensorRT-LLM][INFO] Refreshed the MPI local session.Any suggestion here? Thanks!
The issue happens here:
[01/10/2025-08:19:51] [TRT-LLM] [I] Load engine takes: 10.98725938796997 sec Traceback (most recent call last): File "/workspace/TensorRT-LLM/examples/multimodal/run.py", line 88, in <module> input_text, output_text = model.run(args.input_text, raw_image, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/runtime/multimodal_model_runner.py", line 1989, in run input_text, pre_prompt, post_prompt, processed_image, decoder_input_ids, other_vision_inputs, other_decoder_inputs = self.setup_inputs( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/runtime/multimodal_model_runner.py", line 1728, in setup_inputs processor.apply_chat_template(msg, ^^^^^^^^^ NameError: name 'processor' is not defined. Did you mean: 'self.processor'?
HI @xunuohope1107 ,
Please add processor = AutoProcessor.from_pretrained(self.args.hf_model_dir) in tensorrt_llm/runtime/multimodal_model_runner.py.
Thanks.
Yeah, i have checked tensorrt_llm/runtime/multimodal_model_runner.py. But it already has if self.model_type == "qwen2_vl": hf_config = AutoConfig.from_pretrained(self.args.hf_model_dir) self.vision_start_token_id = hf_config.vision_start_token_id self.vision_end_token_id = hf_config.vision_end_token_id self.vision_token_id = hf_config.vision_token_id self.image_token_id = hf_config.image_token_id self.video_token_id = hf_config.video_token_id self.spatial_merge_size = hf_config.vision_config.spatial_merge_size self.max_position_embeddings = hf_config.max_position_embeddings self.hidden_size = hf_config.hidden_size self.num_attention_heads = hf_config.num_attention_heads self.rope_theta = hf_config.rope_theta
Yeah, i have checked
tensorrt_llm/runtime/multimodal_model_runner.py. But it already hasif self.model_type == "qwen2_vl": hf_config = AutoConfig.from_pretrained(self.args.hf_model_dir) self.vision_start_token_id = hf_config.vision_start_token_id self.vision_end_token_id = hf_config.vision_end_token_id self.vision_token_id = hf_config.vision_token_id self.image_token_id = hf_config.image_token_id self.video_token_id = hf_config.video_token_id self.spatial_merge_size = hf_config.vision_config.spatial_merge_size self.max_position_embeddings = hf_config.max_position_embeddings self.hidden_size = hf_config.hidden_size self.num_attention_heads = hf_config.num_attention_heads self.rope_theta = hf_config.rope_theta
Do you mean modify the code like this:
`elif 'qwen2_vl' in self.model_type:
from qwen_vl_utils import process_vision_info
from transformers.models.qwen2_vl.modeling_qwen2_vl import
VisionRotaryEmbedding
hf_config = AutoConfig.from_pretrained(self.args.hf_model_dir)
if input_text is None:
input_text = "Question: Describe this image. Answer:"
messages = [[{
"role":
"user",
"content": [
{
"type": "image",
"image": raw_image[idx],
},
{
"type": "text",
"text": input_text[idx],
},
],
}] for idx in range(self.args.batch_size)]
texts = [
hf_config.apply_chat_template(msg,
tokenize=False,
add_generation_prompt=True)
for msg in messages
]`
change processor.apply_chat_template to hf_config.apply_chat_template?
Hi, Please use the latest main code and run "pip install -r requirements-qwen2vl.txt" firstly.
Thanks.
Do I still need to install the source code based on the code submitted by 21fac7? Can I use the latest version of transformers directly?
Can try edit tensrrt-llm source:
vim /usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/qwen/config.py +146
I found the root cause, just as where sunnyqgg mentioned in the previous discussion, edit source code at /usr/local/lib/python3.12/dist-packages/tensorrt_llm/runtime/multimodal_model_runner.py like the image bellow:
Can try edit tensrrt-llm source: vim /usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/qwen/config.py +146
This does help. Thanks!
HI @xunuohope1107 , Please add
processor = AutoProcessor.from_pretrained(self.args.hf_model_dir)intensorrt_llm/runtime/multimodal_model_runner.py.Thanks.
Hi @sunnyqgg , I see that the issue is still being reported by the users, can you help file a PR to fix? Thanks.
Hi, The issues above are all solved in the latest main code, if you still meet issues please let me know. cc @kaiyux
This issue was closed because it has been 14 days without activity since it has been marked as stale.