text-generation-inference Idefics tokenizer incorrectly accounts for image context

System Info

v1.4.5

Information

[X] Docker
[X] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Deploy IDEFICS 9b or 80b
Query the model with [large] base64 encoded image: "User: What is in this image?![](<base64-encoded-image-string>)<end_of_utterance>\nAssistant:"

Error: 2024-03-21T21:00:20.970845Z ERROR compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("4-nvidia-a100-sxm4-80gb"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: Some(1.0), frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(256), return_full_text: Some(false), stop: ["<end_of_utterance>", "\nUser:"], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:generate:generate_stream: text_generation_router::infer: router/src/infer.rs:123: inputs tokens + max_new_tokens must be <= 2048. Given: 19649 inputs tokens and 256 max_new_tokens

Expected behavior

The tokenizer should be aware that the <base64-encoded-image-string> part of the prompt is for the image and not count those tokens toward the models max_new_tokens count.

This is noted in the tests here: https://github.com/huggingface/text-generation-inference/blob/4ee0a0c4010b6e000f176977648aa1749339e8cb/integration-tests/models/test_idefics.py#L19

Apr 01 '24 15:04 andrewrreed

I'm encountering the same problem and didn't get it to work with the new 2.0 TGI version. Do we have documentation somewhere on how to correctly send images to endpoints for vision LLMs? @andrewrreed @Narsil

Reproduction for llava-v1.6:

from huggingface_hub import create_inference_endpoint

repository = "llava-hf/llava-v1.6-mistral-7b-hf" 
endpoint_name = "llava-v1-6-mistral-7b-hf-02"  
namespace = "MoritzLaurer"

endpoint = create_inference_endpoint(
    endpoint_name, 
    repository=repository, 
    namespace=namespace,
    framework="pytorch",
    task="text-generation",
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="medium", 
    instance_type="g5.2xlarge",
    min_replica=0,
    max_replica=1,
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_BATCH_PREFILL_TOKENS": "2048",
            "MAX_INPUT_LENGTH": "1024",
            "MAX_TOTAL_TOKENS": "1512",
            "MODEL_ID": "/repository"
        },
        "url": "ghcr.io/huggingface/text-generation-inference:2.0",
    },
)

endpoint.wait()
print(endpoint.status)

from huggingface_hub import HfApi
api = HfApi()
endpoint = api.get_inference_endpoint(name=endpoint_name, namespace=namespace)
print(endpoint)
print(endpoint.raw)

from PIL import Image
import requests
import base64
from io import BytesIO

# fetch image
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG")  # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode()

prompt = f"[INST] {base64_image}\nWhat is shown in this image? [/INST]"
# also tried: 
#prompt = f"[INST] ![]({base64_image})\nWhat is shown in this image? [/INST]"


import requests

API_URL = endpoint.url  
headers = {
	"Accept" : "application/json",
	"Authorization": f"Bearer {huggingface_hub.get_token()}",
	"Content-Type": "application/json" 
}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

output = query({
	"inputs": prompt
	#"parameters": {
	#}
})

output

{'error': 'Input validation error: `inputs` tokens + `max_new_tokens` must be <= 1512. Given: 100946 `inputs` tokens and 100 `max_new_tokens`',
 'error_type': 'validation'}

Edit: Also tried this from the idefics integration test, but doesn't seem to work:

# fetch image
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG")  # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')

image_string = f"data:image/jpeg;base64,{base64_image}"

prompt = f"[INST] ![]({image_string})\nWhat is shown in this image? [/INST]"

this makes the string shorter, but still fails

{'error': 'Input validation error: `inputs` tokens + `max_new_tokens` must be <= 1512. Given: 2750 `inputs` tokens and 100 `max_new_tokens`',
'error_type': 'validation'}

Apr 15 '24 07:04 MoritzLaurer

@MoritzLaurer I found it in the unittest:

![](data:image/png;base64,$IMAGE_BASE64)

Apr 16 '24 04:04 edwardzjl

@edwardzjl , yeah I also tried this yesterday, but then got the max token error with 2750 input tokens mentioned above. I now found out that llava-1.6 actually converts an image to 2000+ tokens. This probably meant that the sending of tokens was actually working but the endpoint I had created only had 2048 max tokens, and the excessive image tokens led to the error. Important thing to keep in mind when working with llava-1.6 to increase the max tokens to maybe 2048*2.

image_string = f"data:image/jpeg;base64,{base64_image}"
prompt = f"[INST] ![]({image_string})\nWhat is shown in this image? [/INST]"

Apr 16 '24 08:04 MoritzLaurer

Weird, when I now create a HF endpoint higher max token limit I now get another error.

from huggingface_hub import create_inference_endpoint

repository = "llava-hf/llava-v1.6-mistral-7b-hf"  
endpoint_name = "llava-v1-6-mistral-7b-hf-002" 
namespace = "MoritzLaurer"

endpoint = create_inference_endpoint(
    endpoint_name, 
    repository=repository,  
    namespace=namespace,
    framework="pytorch",
    task="text-generation",
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="medium",  
    instance_type="g5.2xlarge",
    min_replica=0,
    max_replica=1,
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_BATCH_PREFILL_TOKENS": "4096",
            "MAX_INPUT_LENGTH": "3072",
            "MAX_TOTAL_TOKENS": "3584",
            "MODEL_ID": "/repository"
        },
        "url": "ghcr.io/huggingface/text-generation-inference:2.0",
    },
)

endpoint.wait()
print(endpoint.status)


from PIL import Image
import requests
import base64
from io import BytesIO

# fetch image
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG")  # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')

# format image string 
image_string = f"data:image/jpeg;base64,{base64_image}"
prompt = f"[INST] ![]({image_string})\nWhat is shown in this image? [/INST]"

import requests

API_URL = endpoint.url 
headers = {
	"Accept" : "application/json",
	"Authorization": f"Bearer {huggingface_hub.get_token()}",
	"Content-Type": "application/json" 
}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()


output = query({
	"inputs": prompt
	#"parameters": {
	#}
})

output

{'error': 'Request failed during generation: Server error: CANCELLED',
 'error_type': 'generation'}

And these are the HF endpoint logs:

2024/04/16 10:16:45 ~ {"timestamp":"2024-04-16T08:16:45.465586Z","level":"INFO","fields":{"message":"Args { model_id: \"/repository\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(3072), max_total_tokens: Some(3584), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(4096), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: \"r-moritzlaurer-llava-v1-6-mistral-7b-hf-02-2yvd0n5t-4ad14-c44e8\", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }"},"target":"text_generation_launcher"}
2024/04/16 10:16:45 ~ {"timestamp":"2024-04-16T08:16:45.465648Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
2024/04/16 10:16:45 ~ {"timestamp":"2024-04-16T08:16:45.465729Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/16 10:16:49 ~ {"timestamp":"2024-04-16T08:16:49.024709Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
2024/04/16 10:16:49 ~ {"timestamp":"2024-04-16T08:16:49.641304Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/16 10:16:49 ~ {"timestamp":"2024-04-16T08:16:49.641543Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.377556Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-0\n"},"target":"text_generation_launcher"}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.451718Z","level":"INFO","fields":{"message":"Shard ready in 9.809275402s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.548618Z","level":"INFO","fields":{"message":"Starting Webserver"},"target":"text_generation_launcher"}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652482Z","level":"WARN","message":"Warning: Token '<image>' was expected to have ID '32000' but was given ID 'None'","log.target":"tokenizers::tokenizer::serialization","log.module_path":"tokenizers::tokenizer::serialization","log.file":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs","log.line":159,"target":"tokenizers::tokenizer::serialization","filename":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs","line_number":159}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652522Z","level":"WARN","message":"Warning: Token '<pad>' was expected to have ID '32001' but was given ID 'None'","log.target":"tokenizers::tokenizer::serialization","log.module_path":"tokenizers::tokenizer::serialization","log.file":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs","log.line":159,"target":"tokenizers::tokenizer::serialization","filename":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs","line_number":159}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652861Z","level":"INFO","message":"Using config Some(LlavaNext(LlavaNext { text_config: TextConfig, vision_config: VisionConfig { image_size: 336, patch_size: 14 }, image_grid_pinpoints: [(336, 672), (672, 336), (672, 672), (1008, 336), (336, 1008)] }))","target":"text_generation_router","filename":"router/src/main.rs","line_number":250}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652887Z","level":"INFO","message":"Using local tokenizer config","target":"text_generation_router","filename":"router/src/main.rs","line_number":257}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652933Z","level":"WARN","message":"no pipeline tag found for model /repository","target":"text_generation_router","filename":"router/src/main.rs","line_number":292}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.656858Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":311}
2024/04/16 10:17:01 ~ {"timestamp":"2024-04-16T08:17:01.075402Z","level":"INFO","fields":{"message":"Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]\n"},"target":"text_generation_launcher"}
2024/04/16 10:17:02 ~ {"timestamp":"2024-04-16T08:17:02.110301Z","level":"INFO","message":"Setting max batch total tokens to 491968","target":"text_generation_router","filename":"router/src/main.rs","line_number":348}
2024/04/16 10:17:02 ~ {"timestamp":"2024-04-16T08:17:02.110318Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":349}
2024/04/16 10:17:02 ~ {"timestamp":"2024-04-16T08:17:02.110322Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":363}

Apr 16 '24 08:04 MoritzLaurer

It seems that the current implementation counts the tokens generated from the encoded image as part of the prompt length. It might be better to extract the image features first and then calculate the prompt token length separately. I'm not sure if TGI has support for this approach, as it could be quite involved.

Apr 20 '24 13:04 shuaills

It now works for me with the latest version of TGI and with idefics2 that was recently added (https://github.com/huggingface/text-generation-inference/pull/1756).

I think we can close the issue @andrewrreed @Narsil.

In case someone needs reusable code for deploying and using idefics2 via the HF endpoints:

Create endpoint with latest TGI version:

from huggingface_hub import create_inference_endpoint

repository = "HuggingFaceM4/idefics2-8b"
endpoint_name = "idefics2-8b-00"
namespace = "MoritzLaurer"

endpoint = create_inference_endpoint(
    endpoint_name,
    repository=repository,
    namespace=namespace,
    framework="pytorch",
    task="text-generation",
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="xlarge",  #"medium", 
    instance_type="p4de",  #"g5.2xlarge",
    min_replica=0,
    max_replica=1,
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_BATCH_PREFILL_TOKENS": "4096",
            "MAX_INPUT_LENGTH": "3072",
            "MAX_TOTAL_TOKENS": "3584",
            "MODEL_ID": "/repository"
        },
        "url": "ghcr.io/huggingface/text-generation-inference:latest",
    },
)

endpoint.wait()
print(endpoint.status)

Correctly apply the idefics2 chat template and preprocess images either as URL or as encoded strings:

import torch
from PIL import Image
import requests
import base64
from io import BytesIO


# create the prompt and format it with the chat template
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")

messages = [
    {
        "role": "user", "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "What is the difference between the two images?"},
        ]
    },
]

prompt_with_template = processor.apply_chat_template(messages, add_generation_prompt=True)

print(prompt_with_template, "\n")


# format image input correctly
# you can either provide a url, or the path to local image files
images_from_url = True

if images_from_url:
    image_strings = [
        "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
        "https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg"
    ]
else:
    # load images from local directory
    image_path_lst = ["./data/image1.jpeg", "./data/image2.jpeg"]
    
    # encode images to strings which can be sent to the endpoint
    def encode_local_image(image_path):
        # load image
        image = Image.open(image_path)

        # Convert the image to a base64 string
        buffer = BytesIO()
        image.save(buffer, format="JPEG")  # Use the appropriate format (e.g., JPEG, PNG)
        base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')

        # add string formatting required by the endpoint
        image_string = f"data:image/jpeg;base64,{base64_image}"

        return image_string
    
    image_strings = []
    for image_path in image_path_lst:
        image_strings.append(encode_local_image(image_path))

        
# insert the image url or encoded strings into the formatted prompt
def insert_image_strings_in_prompt(prompt_with_template, image_strings):
    # Create an iterator from the replacements list
    image_string_iter = iter(image_strings)
    # Use a generator expression to replace each <image> with the next item from the iterator
    prompt_with_template_and_images = prompt_with_template.replace("<image>", "![]({}) ").format(*image_string_iter)
    return prompt_with_template_and_images


prompt_with_images = insert_image_strings_in_prompt(prompt_with_template, image_strings)
print(prompt_with_images)

API call to active endpoint:


API_URL = endpoint.url  
headers = {
	"Accept" : "application/json",
	"Authorization": f"Bearer {huggingface_hub.get_token()}",
	"Content-Type": "application/json" 
}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()


output = query({
	"inputs": prompt_with_images,
	"parameters": {
	    "return_full_text": False,
    }
})

output
# [{'generated_text': ' The first image shows the Statue of Liberty in New York City while the second image shows the Golden Gate Bridge in San Francisco.'}]

Apr 25 '24 13:04 MoritzLaurer

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jun 26 '24 01:06 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Idefics tokenizer incorrectly accounts for image context

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard