TensorRT-LLM
TensorRT-LLM copied to clipboard
QwenVL visual_encoder failure
System Info
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600[02/16/2024-22:04:57] [TRT-LLM] [I] Loading engine from ./plan/visual_encoder/visual_encoder_fp16.plan
[02/16/2024-22:05:00] [TRT-LLM] [I] Creating session from engine ./plan/visual_encoder/visual_encoder_fp16.plan
[02/16/2024-22:05:00] [TRT] [I] Loaded engine size: 3714 MiB
[02/16/2024-22:05:00] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +3699, now: CPU 0, GPU 3699 (MiB)
[02/16/2024-22:05:00] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +190, now: CPU 0, GPU 3889 (MiB)
[02/16/2024-22:05:00] [TRT] [E] 3: [executionContext.cpp::setInputShape::2278] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2278, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
Traceback (most recent call last):
File "/home/kye/TensorRT-LLM/examples/qwenvl/run.py", line 481, in
I've tried downloading and following the instructions and I know qwen is 448, I'm not sure why it's forcing the example run.py image to 224 which is probably causing the error
Who can help?
No response
Information
- [X] The official example scripts
- [x] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
follow qwenvl example instructions to build and then run
Expected behavior
shape probably to be 448, i need the image embeddings from the visual encoder that onnx compiled
actual behavior
shape mismatch, could not set shape 224 when it needs to be 448
additional notes
I've narrowed this down to the visual encoder in vit_process,
def vit_process(image_path, engine_dir, stream): vit_path = os.path.join(engine_dir, 'visual_encoder/visual_encoder_fp16.plan') logger.info(f'Loading engine from {vit_path}') with open(vit_path, 'rb') as f: engine_buffer = f.read() logger.info(f'Creating session from engine {vit_path}') session_vit = Session.from_serialized_engine(engine_buffer) device = torch.device("cuda") if torch.cuda.is_available() else "cpu" images_list = [] for img in image_path: for v in img.values(): image = torch.load(v) if image.device.type == 'cpu': image = image.to(device) images_list.append(image) images = torch.cat(images_list) batch_size = images.size(0) images = images.expand(batch_size, -1, -1, -1).contiguous() visual_inputs = {'input': images.float()} visual_output_info = session_vit.infer_shapes( [TensorInfo('input', trt.DataType.FLOAT, images.shape)]) visual_outputs = { t.name: torch.empty(tuple(t.shape), dtype=trt_dtype_to_torch(t.dtype), device='cuda') for t in visual_output_info } profiler.start("ViT")
run_time = 1
for _ in range(run_time):
ok = session_vit.run(visual_inputs, visual_outputs, stream)
profiler.stop("ViT")
Vit_time = profiler.elapsed_time_in_sec("ViT") / run_time
logger.info(f'TensorRT-LLM ViT latency: {Vit_time} sec ')
assert ok, "Runtime execution failed for vit session"
image_embeds = visual_outputs['output']
return image_embeds
if name == 'main': args = parse_arguments() stream = torch.cuda.current_stream().cuda_stream tensorrt_llm.logger.set_level(args.log_level) image_embeds = vit_process(args.input_dir, args.vit_engine_dir, stream) qinfer = QWenInfer(args.tokenizer_dir, args.qwen_engine_dir, args.log_level, args.output_csv, args.output_npy, args.num_beams) qinfer.qwen_model_init() qinfer.qwen_infer(image_embeds, args.images_path, args.input_text, args.max_new_tokens, history=[])