text-generation-inference Image eats up way too many tokens

System Info

Using Inference Endpoint here: https://endpoints.huggingface.co/m-ric/endpoints/qwen2-72b-instruct-psj ghcr.io/huggingface/text-generation-inference:3.0.1

Information

[ ] Docker
[x] The CLI directly

Tasks

[x] An officially supported command
[ ] My own modifications

Reproduction

Here's what I'm trying to run:

import base64
from openai import OpenAI
import os
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(
    base_url="https://lmqbs8965pj40e01.us-east-1.aws.endpoints.huggingface.cloud/v1",
    api_key=os.getenv("HF_TOKEN")
)

with open('./screenshot.png', 'rb') as img_file:
    base64_image = base64.b64encode(img_file.read()).decode('utf-8')


client.chat.completions.create(
    model="a",
    messages=[
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's on this screenshot?"},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{base64_image}"}
            }
        ]
    }
])

The image is not big, here it is:

I get this errror:

huggingface_hub.errors.HfHubHTTPError: 422 Client Error: Unprocessable Entity for url: https://lmqbs8965pj40e01.us-east-1.aws.endpoints.huggingface.cloud/v1/chat/completions (Request ID: 9kQ8on)

Input validation error: `inputs` tokens + `max_new_tokens` must be <= 32768. Given: 96721 `inputs` tokens and 0 `max_new_tokens`

It seems like my image was converted to a very large image, when original it is roughly only 1000*1000 pixels.

Expected behavior

I'd expect the uploaded image to be <1k tokens instead of ~100k tokens.

Other APIs (OpenAI, Anthropic) handle the same image fine, so I'm wondering: do they do some image size reduction pre-processing? Or is this a bug on TGI side?

Jan 17 '25 16:01 aymeric-roucher

I am also facing similar issue and it reads to me that TGI validation logic is counting tokens incorrect in case of inline images: https://github.com/huggingface/text-generation-inference/blob/main/integration-tests/conftest.py#L668

Please let me know if there is any work around to this? The images which I am passing are leading to 100k+ tokens

Jan 19 '25 20:01 sanbindal1990

Any updates on this issue? It is happening on 3.2.0 too.

Mar 18 '25 00:03 ktobah

some more hints on that issue (tested with Qwen 2.5 VL 32B)

In this same scenario where images require too many tokens, I found that if I provide the image through a http / https URL rather than encoding the image in the request, inference works.
If the URL does not exist, rather than failing, inference continues and as it could not read the image, the model hallucinates something.
If the URL involves a redirect (like images posted on a huggingface dataset accessed through a link) it fails.

Apr 07 '25 14:04 mfarre

extra hint:

if the URL has parameters, like an AWS signed URL it does not work either.

Apr 09 '25 08:04 mfarre

For me, the signed AWS S3 URL work; however, it is failing for many images due to the token count. Any solution to this, please?

Apr 15 '25 02:04 ktobah

I don't understand how this is a problem for TGI to solve, when the real issue is with (you) the source image being too large. Maybe consider tiling or using proper tools or libraries, such as:

OpenCV / PIL / scikit-image (Image resizing, tiling, format conversion)
timm (Access to high-performance ViTs, Swin, ConvNeXt with large image support)
Segment Anything (For large image segmentation with built-in tiling logic)
Detectron2 / MMDetection (For object detection and segmentation with large image pipelines)
MONAI (Medical imaging toolkit with sliding window inference)
Patchify / unpatchify (Simple patch extraction and reconstruction utilities)
Rasterio (For geospatial large image processing)
Longformer / BigBird / Perceiver IO (For sparse attention on large token sets)
xFormers (Modular attention blocks supporting efficient transformers)
ONNX Runtime / TensorRT (Accelerated inference with batch or tile-based input support)
TorchScript / TorchDynamo / IPEX (Optimize PyTorch models for inference)
DeepSpeed / Hugging Face Accelerate (Scale transformer-based vision models with low memory overhead)
OpenCLIP / CLIP-as-service (Extract image embeddings for lightweight downstream tasks)
LaViLa (For region-aware inference on large images)

And the list goes on, and on...

Apr 17 '25 05:04 suparious

As mentioned before, there are other open issues similar to this, There is a problem with how the number of tokens is calculated.

Sometimes for the same image, if you pass it as a URL, it works. If you pass it as base64, it fails because it exceeds the max token length.

Apr 17 '25 05:04 ktobah

this PR merged last week https://github.com/huggingface/text-generation-inference/pull/3157 fixed the problem for me.

as now it is in the main branch, try to build TGI from main and see if it fixes the problem for you.

Apr 17 '25 06:04 mfarre

Hi, I still see the same issue (HF Magma model, hosted in Azure AI Foundry).

The validator wants (input image + prompt + max_new_tokens) < 4096 If I pass a base64 encoded image, looks like it counts each byte to 1 token. Looks like Is it treating it as string and validating token limit before decoding the image. Consequently, I can only have image less than ~3KB (leaving 1 KB for text in+ out). There must be another way to pass image (beside public url), right?

For context, I also have directly loaded model from GH repo and run inference using base64 image + text, and that works with no issue. But in that case, I decode base64 into image before passing to the model.

Apr 23 '25 17:04 GitAashishG

I am experiencing the same problem

May 20 '25 08:05 masc-it

After updating TGI to 3.3.0, indeed, the error went away! Full code to reproduce and check that the token count is correct, using a Qwen-VL-72B endpoint:

import os

from dotenv import load_dotenv
from openai import OpenAI
from qwen_vl_utils import process_vision_info
from transformers.models.auto.processing_auto import AutoImageProcessor


load_dotenv(override=True)

client = OpenAI(
    base_url="https://n5wr7lfx6wp94tvl.us-east-1.aws.endpoints.huggingface.cloud/v1", api_key=os.getenv("HF_TOKEN")
)


processor = AutoImageProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
# text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    images=image_inputs,
    videos=video_inputs,
)

token_count = inputs["pixel_values"].shape[0]

if token_count > 32000:
    raise ValueError(f"Token count ({token_count}) exceeds the 32k limit for Qwen model")
else:
    print(f"All is right on the token count: {token_count} tokens")

output = client.chat.completions.create(
    model="a",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's on this screenshot?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
                },
            ],
        }
    ],
)
print(output)

May 20 '25 12:05 aymeric-roucher