tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Mllama ignores input image when deployed in triton

Open mutkach opened this issue 9 months ago • 2 comments

System Info

cpu: x86_64 mem: 128G gpu: H100 80G docker: tritonserver:24.12-trtllm-python-py3 Cuda: 12.6 Driver: 535.216.01 TensorRT: 10.7.0 TensorRT-LLM: v0.16.0

Who can help?

@kaiyux @byshiue

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Steps to reproduce: Using scripts for Mllama build and deployment multimodal.md: except:

  • use Visual-Instruct-11B instead of Visual-11B
  • set max_encoder_input_len to 6404 for Visual-Instruct-11B as indicated by tensorRT-LLM guide
  • set 1 batch size for testing purposes
  • checkout v0.16.0 tag for TensorRT-LLM (there's discrepancies when converting checkpoint otherwise)
  • fill in cross_kv_cache_fraction in config.pbtxt - 0.5 (it won't start in triton otherwise)
  • starting triton manually with a command
  • load ensemble model (e2e setup would not work otherwise)

triton command is tritonserver --model-repository=multimodal_ifb --model-control-mode=explicit --log-verbose=3 --load-model=tensorrt_llm --load-model=multimodal_encoders --load-model=ensemble --load-model=tensorrt_llm_bls --cuda-memory-pool-byte-size=0:300000000

Expected behavior

When tested with ...

python3 tensorrt_llm/examples/multimodal/run.py --visual_engine_dir /tmp/mllama/trt_engines/encoder/ \
                                   --visual_engine_name visual_encoder.engine \
                                   --llm_engine_dir /tmp/mllama/trt_engines/decoder/ \
                                   --hf_model_dir Llama-3.2-11B-Vision-Instruct/ \
                                   --image_path https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg \
                                   --input_text "<|image|><|begin_of_text|>If I had to write a haiku for this one" \
                                   --max_new_tokens 50 \
                                   --batch_size 1

output:

", it would be:.\\nA rabbit in a coat.\\nA charming and dapper fellow.\\nHe's a stylish chap indeed. <OCR/> ርርርርርር

Works as expected.

actual behavior

When run with:

python3 tools/multimodal/client.py  --model_type mllama --text "<|image|><|begin_of_text|>If I had to write a haiku for this one" --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffuser
s/rabbit.jpg --top-p 0.7 --temperature 0.9 --top-k 40 --request-output-len 20

the result is:

[beam 0 ]:
<|image|><|begin_of_text|>If I had to write a haiku for this one, it would be:
“Golden sunsets fade
Gone, yet memories remain
Summer's

When shown a different image or provided different runtime parameters would similarly ignore image content (different image -> same output).

additional notes

Double-checked tokenization output and also I checked that image inputs are sent correctly (image_bytes) and verified encoder_input_features and cross_attention_masks (multimodal_encoder outputs) are in the same ballpark (though not same or equal by any means) when run with tensorrt_llm/examples/multimodal/run.py

encoder_input_features in triton:

tensor([[[  8.1875,  12.3750,  -4.5938,  ..., -12.1875,  -4.4062,   5.1250],
         [ -1.0625,  13.5000,   7.4375,  ...,  -2.3125,  -3.0625, -13.2500],
         [-12.5000,   7.0625,   8.5625,  ...,   3.1875,  -0.1836,  -8.4375],
         ...,
         [ -3.8906,  -2.5625,  -6.0938,  ...,  -2.2812,  -8.1875,  -3.0312],
         [  2.7031,   7.0938,  -7.6875,  ...,  -8.5625,  -4.4062, -22.2500],
         [  4.2500,   1.2734,   1.5156,  ...,  -1.8359,  -2.5312,   1.5625]]],
       device='cuda:0', dtype=torch.bfloat16)

in tensorrt_llm runner

tensor([[  8.1250,  12.3750,  -4.6875,  ..., -12.1875,  -4.3438,   5.2500],
        [ -1.1328,  13.3125,   7.4062,  ...,  -2.6250,  -2.9531, -13.1875],
        [-12.3750,   6.9688,   8.5625,  ...,   2.9688,  -0.2139,  -8.6250],
        ...,
        [ -5.4375,  -2.8125,  -6.9375,  ...,  -3.4375,  -7.8125,  -3.7969],
        [  1.1641,   6.9062,  -3.5000,  ...,  -3.0625,  -2.9688, -27.2500],
        [  4.6562,   1.3906,   1.6953,  ...,  -1.6484,  -2.9375,   1.3281]],
       device='cuda:0', dtype=torch.bfloat16)

If that difference is not ok, should I look into that? Aside of that, the bls setting also not working. The LLM itself seems to be working fine and giving correct responses.

mutkach avatar Feb 05 '25 12:02 mutkach