text-generation-inference
text-generation-inference copied to clipboard
Idefics tokenizer incorrectly accounts for image context
System Info
v1.4.5
Information
- [X] Docker
- [X] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- Deploy IDEFICS 9b or 80b
- Query the model with [large] base64 encoded image:
"User: What is in this image?<end_of_utterance>\nAssistant:"
Error:
2024-03-21T21:00:20.970845Z ERROR compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("4-nvidia-a100-sxm4-80gb"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: Some(1.0), frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(256), return_full_text: Some(false), stop: ["<end_of_utterance>", "\nUser:"], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:generate:generate_stream: text_generation_router::infer: router/src/infer.rs:123: inputs
tokens + max_new_tokens
must be <= 2048. Given: 19649 inputs
tokens and 256 max_new_tokens
Expected behavior
The tokenizer should be aware that the <base64-encoded-image-string>
part of the prompt is for the image and not count those tokens toward the models max_new_tokens
count.
This is noted in the tests here: https://github.com/huggingface/text-generation-inference/blob/4ee0a0c4010b6e000f176977648aa1749339e8cb/integration-tests/models/test_idefics.py#L19
I'm encountering the same problem and didn't get it to work with the new 2.0 TGI version. Do we have documentation somewhere on how to correctly send images to endpoints for vision LLMs? @andrewrreed @Narsil
Reproduction for llava-v1.6:
from huggingface_hub import create_inference_endpoint
repository = "llava-hf/llava-v1.6-mistral-7b-hf"
endpoint_name = "llava-v1-6-mistral-7b-hf-02"
namespace = "MoritzLaurer"
endpoint = create_inference_endpoint(
endpoint_name,
repository=repository,
namespace=namespace,
framework="pytorch",
task="text-generation",
accelerator="gpu",
vendor="aws",
region="us-east-1",
type="protected",
instance_size="medium",
instance_type="g5.2xlarge",
min_replica=0,
max_replica=1,
custom_image={
"health_route": "/health",
"env": {
"MAX_BATCH_PREFILL_TOKENS": "2048",
"MAX_INPUT_LENGTH": "1024",
"MAX_TOTAL_TOKENS": "1512",
"MODEL_ID": "/repository"
},
"url": "ghcr.io/huggingface/text-generation-inference:2.0",
},
)
endpoint.wait()
print(endpoint.status)
from huggingface_hub import HfApi
api = HfApi()
endpoint = api.get_inference_endpoint(name=endpoint_name, namespace=namespace)
print(endpoint)
print(endpoint.raw)
from PIL import Image
import requests
import base64
from io import BytesIO
# fetch image
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG") # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode()
prompt = f"[INST] {base64_image}\nWhat is shown in this image? [/INST]"
# also tried:
#prompt = f"[INST] \nWhat is shown in this image? [/INST]"
import requests
API_URL = endpoint.url
headers = {
"Accept" : "application/json",
"Authorization": f"Bearer {huggingface_hub.get_token()}",
"Content-Type": "application/json"
}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": prompt
#"parameters": {
#}
})
output
{'error': 'Input validation error: `inputs` tokens + `max_new_tokens` must be <= 1512. Given: 100946 `inputs` tokens and 100 `max_new_tokens`',
'error_type': 'validation'}
Edit: Also tried this from the idefics integration test, but doesn't seem to work:
# fetch image
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG") # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')
image_string = f"data:image/jpeg;base64,{base64_image}"
prompt = f"[INST] \nWhat is shown in this image? [/INST]"
this makes the string shorter, but still fails
{'error': 'Input validation error: `inputs` tokens + `max_new_tokens` must be <= 1512. Given: 2750 `inputs` tokens and 100 `max_new_tokens`',
'error_type': 'validation'}
@edwardzjl , yeah I also tried this yesterday, but then got the max token error with 2750 input tokens mentioned above. I now found out that llava-1.6 actually converts an image to 2000+ tokens. This probably meant that the sending of tokens was actually working but the endpoint I had created only had 2048 max tokens, and the excessive image tokens led to the error. Important thing to keep in mind when working with llava-1.6 to increase the max tokens to maybe 2048*2.
image_string = f"data:image/jpeg;base64,{base64_image}"
prompt = f"[INST] \nWhat is shown in this image? [/INST]"
Weird, when I now create a HF endpoint higher max token limit I now get another error.
from huggingface_hub import create_inference_endpoint
repository = "llava-hf/llava-v1.6-mistral-7b-hf"
endpoint_name = "llava-v1-6-mistral-7b-hf-002"
namespace = "MoritzLaurer"
endpoint = create_inference_endpoint(
endpoint_name,
repository=repository,
namespace=namespace,
framework="pytorch",
task="text-generation",
accelerator="gpu",
vendor="aws",
region="us-east-1",
type="protected",
instance_size="medium",
instance_type="g5.2xlarge",
min_replica=0,
max_replica=1,
custom_image={
"health_route": "/health",
"env": {
"MAX_BATCH_PREFILL_TOKENS": "4096",
"MAX_INPUT_LENGTH": "3072",
"MAX_TOTAL_TOKENS": "3584",
"MODEL_ID": "/repository"
},
"url": "ghcr.io/huggingface/text-generation-inference:2.0",
},
)
endpoint.wait()
print(endpoint.status)
from PIL import Image
import requests
import base64
from io import BytesIO
# fetch image
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG") # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')
# format image string
image_string = f"data:image/jpeg;base64,{base64_image}"
prompt = f"[INST] \nWhat is shown in this image? [/INST]"
import requests
API_URL = endpoint.url
headers = {
"Accept" : "application/json",
"Authorization": f"Bearer {huggingface_hub.get_token()}",
"Content-Type": "application/json"
}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": prompt
#"parameters": {
#}
})
output
{'error': 'Request failed during generation: Server error: CANCELLED',
'error_type': 'generation'}
And these are the HF endpoint logs:
2024/04/16 10:16:45 ~ {"timestamp":"2024-04-16T08:16:45.465586Z","level":"INFO","fields":{"message":"Args { model_id: \"/repository\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(3072), max_total_tokens: Some(3584), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(4096), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: \"r-moritzlaurer-llava-v1-6-mistral-7b-hf-02-2yvd0n5t-4ad14-c44e8\", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }"},"target":"text_generation_launcher"}
2024/04/16 10:16:45 ~ {"timestamp":"2024-04-16T08:16:45.465648Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
2024/04/16 10:16:45 ~ {"timestamp":"2024-04-16T08:16:45.465729Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/16 10:16:49 ~ {"timestamp":"2024-04-16T08:16:49.024709Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
2024/04/16 10:16:49 ~ {"timestamp":"2024-04-16T08:16:49.641304Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/16 10:16:49 ~ {"timestamp":"2024-04-16T08:16:49.641543Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.377556Z","level":"INFO","fields":{"message":"Server started at unix:///tmp/text-generation-server-0\n"},"target":"text_generation_launcher"}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.451718Z","level":"INFO","fields":{"message":"Shard ready in 9.809275402s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.548618Z","level":"INFO","fields":{"message":"Starting Webserver"},"target":"text_generation_launcher"}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652482Z","level":"WARN","message":"Warning: Token '<image>' was expected to have ID '32000' but was given ID 'None'","log.target":"tokenizers::tokenizer::serialization","log.module_path":"tokenizers::tokenizer::serialization","log.file":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs","log.line":159,"target":"tokenizers::tokenizer::serialization","filename":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs","line_number":159}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652522Z","level":"WARN","message":"Warning: Token '<pad>' was expected to have ID '32001' but was given ID 'None'","log.target":"tokenizers::tokenizer::serialization","log.module_path":"tokenizers::tokenizer::serialization","log.file":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs","log.line":159,"target":"tokenizers::tokenizer::serialization","filename":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs","line_number":159}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652861Z","level":"INFO","message":"Using config Some(LlavaNext(LlavaNext { text_config: TextConfig, vision_config: VisionConfig { image_size: 336, patch_size: 14 }, image_grid_pinpoints: [(336, 672), (672, 336), (672, 672), (1008, 336), (336, 1008)] }))","target":"text_generation_router","filename":"router/src/main.rs","line_number":250}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652887Z","level":"INFO","message":"Using local tokenizer config","target":"text_generation_router","filename":"router/src/main.rs","line_number":257}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.652933Z","level":"WARN","message":"no pipeline tag found for model /repository","target":"text_generation_router","filename":"router/src/main.rs","line_number":292}
2024/04/16 10:16:59 ~ {"timestamp":"2024-04-16T08:16:59.656858Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":311}
2024/04/16 10:17:01 ~ {"timestamp":"2024-04-16T08:17:01.075402Z","level":"INFO","fields":{"message":"Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]\n"},"target":"text_generation_launcher"}
2024/04/16 10:17:02 ~ {"timestamp":"2024-04-16T08:17:02.110301Z","level":"INFO","message":"Setting max batch total tokens to 491968","target":"text_generation_router","filename":"router/src/main.rs","line_number":348}
2024/04/16 10:17:02 ~ {"timestamp":"2024-04-16T08:17:02.110318Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":349}
2024/04/16 10:17:02 ~ {"timestamp":"2024-04-16T08:17:02.110322Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":363}
It seems that the current implementation counts the tokens generated from the encoded image as part of the prompt length. It might be better to extract the image features first and then calculate the prompt token length separately. I'm not sure if TGI has support for this approach, as it could be quite involved.
It now works for me with the latest version of TGI and with idefics2 that was recently added (https://github.com/huggingface/text-generation-inference/pull/1756).
I think we can close the issue @andrewrreed @Narsil.
In case someone needs reusable code for deploying and using idefics2 via the HF endpoints:
Create endpoint with latest TGI version:
from huggingface_hub import create_inference_endpoint
repository = "HuggingFaceM4/idefics2-8b"
endpoint_name = "idefics2-8b-00"
namespace = "MoritzLaurer"
endpoint = create_inference_endpoint(
endpoint_name,
repository=repository,
namespace=namespace,
framework="pytorch",
task="text-generation",
accelerator="gpu",
vendor="aws",
region="us-east-1",
type="protected",
instance_size="xlarge", #"medium",
instance_type="p4de", #"g5.2xlarge",
min_replica=0,
max_replica=1,
custom_image={
"health_route": "/health",
"env": {
"MAX_BATCH_PREFILL_TOKENS": "4096",
"MAX_INPUT_LENGTH": "3072",
"MAX_TOTAL_TOKENS": "3584",
"MODEL_ID": "/repository"
},
"url": "ghcr.io/huggingface/text-generation-inference:latest",
},
)
endpoint.wait()
print(endpoint.status)
Correctly apply the idefics2 chat template and preprocess images either as URL or as encoded strings:
import torch
from PIL import Image
import requests
import base64
from io import BytesIO
# create the prompt and format it with the chat template
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
messages = [
{
"role": "user", "content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "What is the difference between the two images?"},
]
},
]
prompt_with_template = processor.apply_chat_template(messages, add_generation_prompt=True)
print(prompt_with_template, "\n")
# format image input correctly
# you can either provide a url, or the path to local image files
images_from_url = True
if images_from_url:
image_strings = [
"https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
"https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg"
]
else:
# load images from local directory
image_path_lst = ["./data/image1.jpeg", "./data/image2.jpeg"]
# encode images to strings which can be sent to the endpoint
def encode_local_image(image_path):
# load image
image = Image.open(image_path)
# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG") # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')
# add string formatting required by the endpoint
image_string = f"data:image/jpeg;base64,{base64_image}"
return image_string
image_strings = []
for image_path in image_path_lst:
image_strings.append(encode_local_image(image_path))
# insert the image url or encoded strings into the formatted prompt
def insert_image_strings_in_prompt(prompt_with_template, image_strings):
# Create an iterator from the replacements list
image_string_iter = iter(image_strings)
# Use a generator expression to replace each <image> with the next item from the iterator
prompt_with_template_and_images = prompt_with_template.replace("<image>", " ").format(*image_string_iter)
return prompt_with_template_and_images
prompt_with_images = insert_image_strings_in_prompt(prompt_with_template, image_strings)
print(prompt_with_images)
API call to active endpoint:
API_URL = endpoint.url
headers = {
"Accept" : "application/json",
"Authorization": f"Bearer {huggingface_hub.get_token()}",
"Content-Type": "application/json"
}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": prompt_with_images,
"parameters": {
"return_full_text": False,
}
})
output
# [{'generated_text': ' The first image shows the Statue of Liberty in New York City while the second image shows the Golden Gate Bridge in San Francisco.'}]
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.