Image eats up way too many tokens
System Info
Using Inference Endpoint here: https://endpoints.huggingface.co/m-ric/endpoints/qwen2-72b-instruct-psj ghcr.io/huggingface/text-generation-inference:3.0.1
Information
- [ ] Docker
- [x] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
Here's what I'm trying to run:
import base64
from openai import OpenAI
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
base_url="https://lmqbs8965pj40e01.us-east-1.aws.endpoints.huggingface.cloud/v1",
api_key=os.getenv("HF_TOKEN")
)
with open('./screenshot.png', 'rb') as img_file:
base64_image = base64.b64encode(img_file.read()).decode('utf-8')
client.chat.completions.create(
model="a",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's on this screenshot?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{base64_image}"}
}
]
}
])
The image is not big, here it is:
I get this errror:
huggingface_hub.errors.HfHubHTTPError: 422 Client Error: Unprocessable Entity for url: https://lmqbs8965pj40e01.us-east-1.aws.endpoints.huggingface.cloud/v1/chat/completions (Request ID: 9kQ8on)
Input validation error: `inputs` tokens + `max_new_tokens` must be <= 32768. Given: 96721 `inputs` tokens and 0 `max_new_tokens`
It seems like my image was converted to a very large image, when original it is roughly only 1000*1000 pixels.
Expected behavior
I'd expect the uploaded image to be <1k tokens instead of ~100k tokens.
Other APIs (OpenAI, Anthropic) handle the same image fine, so I'm wondering: do they do some image size reduction pre-processing? Or is this a bug on TGI side?
I am also facing similar issue and it reads to me that TGI validation logic is counting tokens incorrect in case of inline images: https://github.com/huggingface/text-generation-inference/blob/main/integration-tests/conftest.py#L668
Please let me know if there is any work around to this? The images which I am passing are leading to 100k+ tokens
Any updates on this issue? It is happening on 3.2.0 too.
some more hints on that issue (tested with Qwen 2.5 VL 32B)
- In this same scenario where images require too many tokens, I found that if I provide the image through a http / https URL rather than encoding the image in the request, inference works.
- If the URL does not exist, rather than failing, inference continues and as it could not read the image, the model hallucinates something.
- If the URL involves a redirect (like images posted on a huggingface dataset accessed through a link) it fails.
extra hint:
- if the URL has parameters, like an AWS signed URL it does not work either.
For me, the signed AWS S3 URL work; however, it is failing for many images due to the token count. Any solution to this, please?
I don't understand how this is a problem for TGI to solve, when the real issue is with (you) the source image being too large. Maybe consider tiling or using proper tools or libraries, such as:
- OpenCV / PIL / scikit-image (Image resizing, tiling, format conversion)
- timm (Access to high-performance ViTs, Swin, ConvNeXt with large image support)
- Segment Anything (For large image segmentation with built-in tiling logic)
- Detectron2 / MMDetection (For object detection and segmentation with large image pipelines)
- MONAI (Medical imaging toolkit with sliding window inference)
- Patchify / unpatchify (Simple patch extraction and reconstruction utilities)
- Rasterio (For geospatial large image processing)
- Longformer / BigBird / Perceiver IO (For sparse attention on large token sets)
- xFormers (Modular attention blocks supporting efficient transformers)
- ONNX Runtime / TensorRT (Accelerated inference with batch or tile-based input support)
- TorchScript / TorchDynamo / IPEX (Optimize PyTorch models for inference)
- DeepSpeed / Hugging Face Accelerate (Scale transformer-based vision models with low memory overhead)
- OpenCLIP / CLIP-as-service (Extract image embeddings for lightweight downstream tasks)
- LaViLa (For region-aware inference on large images)
And the list goes on, and on...
As mentioned before, there are other open issues similar to this, There is a problem with how the number of tokens is calculated.
Sometimes for the same image, if you pass it as a URL, it works. If you pass it as base64, it fails because it exceeds the max token length.
this PR merged last week https://github.com/huggingface/text-generation-inference/pull/3157 fixed the problem for me.
as now it is in the main branch, try to build TGI from main and see if it fixes the problem for you.
Hi, I still see the same issue (HF Magma model, hosted in Azure AI Foundry).
The validator wants (input image + prompt + max_new_tokens) < 4096 If I pass a base64 encoded image, looks like it counts each byte to 1 token. Looks like Is it treating it as string and validating token limit before decoding the image. Consequently, I can only have image less than ~3KB (leaving 1 KB for text in+ out). There must be another way to pass image (beside public url), right?
For context, I also have directly loaded model from GH repo and run inference using base64 image + text, and that works with no issue. But in that case, I decode base64 into image before passing to the model.
I am experiencing the same problem
After updating TGI to 3.3.0, indeed, the error went away! Full code to reproduce and check that the token count is correct, using a Qwen-VL-72B endpoint:
import os
from dotenv import load_dotenv
from openai import OpenAI
from qwen_vl_utils import process_vision_info
from transformers.models.auto.processing_auto import AutoImageProcessor
load_dotenv(override=True)
client = OpenAI(
base_url="https://n5wr7lfx6wp94tvl.us-east-1.aws.endpoints.huggingface.cloud/v1", api_key=os.getenv("HF_TOKEN")
)
processor = AutoImageProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
# text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
images=image_inputs,
videos=video_inputs,
)
token_count = inputs["pixel_values"].shape[0]
if token_count > 32000:
raise ValueError(f"Token count ({token_count}) exceeds the 32k limit for Qwen model")
else:
print(f"All is right on the token count: {token_count} tokens")
output = client.chat.completions.create(
model="a",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's on this screenshot?"},
{
"type": "image_url",
"image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
},
],
}
],
)
print(output)