text-generation-inference Llava Next crashes on certain image sizes

System Info

Running in docker

Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: 00f365353ea5cf29438ba1d51baadaab79ae4674
Docker label: sha-00f3653
nvidia-smi:
Sat Apr 20 00:19:12 2024       
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA RTX A6000               On  | 00000000:44:00.0 Off |                  Off |
   | 30%   33C    P8              21W / 300W |  45076MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
                                                                                            
   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   +---------------------------------------------------------------------------------------+

CLI Arguments

 model_id: "llava-hf/llava-v1.6-34b-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(BitsandbytesNF4), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(4095), max_total_tokens: Some(4096), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(4096), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true

Info

{
  "model_id": "llava-hf/llava-v1.6-34b-hf",
  "model_sha": "5400ac92f6e1595288302ba9ab20db8542c0b8e5",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "image-text-to-text",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 4095,
  "max_total_tokens": 4096,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 108112,
  "max_waiting_tokens": 20,
  "max_batch_size": null,
  "validation_workers": 2,
  "version": "2.0.0",
  "sha": "00f365353ea5cf29438ba1d51baadaab79ae4674",
  "docker_label": "sha-00f3653"
}

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Here is a script that I run on this image with the prompt Describe the image?. Note the image is (286 × 524). It returns an error and the service crashes.

test2

from PIL import Image
import requests
import base64
from io import BytesIO

# fetch image
image = Image.open("test2.jpeg")

# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG")  # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')

# format image string 
image_string = f"data:image/jpeg;base64,{base64_image}"
query = "Describe the image?"
prompt=f"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n![]({image_string})\n{query}<|im_end|><|im_start|>assistant\n"

headers = {
	"Accept" : "application/json",
	"Content-Type": "application/json" 
}

payload = {"inputs":prompt}
response = requests.post("endpoint/generate", headers=headers, json=payload)
response.json()

{'error': 'Request failed during generation: Server error: CANCELLED',
 'error_type': 'generation'}

Logs from the tgi service

tgi-llava-1  | 2024-04-20T00:13:55.522584Z ERROR text_generation_launcher: Method Prefill encountered an error.
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-llava-1  |     sys.exit(app())
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llava-1  |     return get_command(self)(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llava-1  |     return self.main(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llava-1  |     return _main(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llava-1  |     rv = self.invoke(ctx)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llava-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llava-1  |     return ctx.invoke(self.callback, **ctx.params)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llava-1  |     return __callback(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llava-1  |     return callback(**use_params)  # type: ignore
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
tgi-llava-1  |     server.serve(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
tgi-llava-1  |     asyncio.run(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llava-1  |     return loop.run_until_complete(main)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llava-1  |     self.run_forever()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llava-1  |     self._run_once()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llava-1  |     handle._run()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llava-1  |     self._context.run(self._callback, *self._args)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llava-1  |     return await self.intercept(
tgi-llava-1  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llava-1  |     return await response
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
tgi-llava-1  |     raise error
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
tgi-llava-1  |     return await behavior(request_or_iterator, context)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Prefill
tgi-llava-1  |     generations, next_batch, timings = self.model.generate_token(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llava-1  |     return func(*args, **kwds)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
tgi-llava-1  |     raise e
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
tgi-llava-1  |     out, speculative_logits = self.forward(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 285, in forward
tgi-llava-1  |     logits, speculative_logits = self.model.forward(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 283, in forward
tgi-llava-1  |     inputs_embeds = self._merge_input_ids_with_image_features(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 183, in _merge_input_ids_with_image_features
tgi-llava-1  |     inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
tgi-llava-1  | RuntimeError: shape mismatch: value tensor of shape [1676, 7168] cannot be broadcast to indexing result of shape [2781, 7168]
tgi-llava-1  | 
tgi-llava-1  | 2024-04-20T00:13:55.879011Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
tgi-llava-1  | 2024-04-20T00:13:56.509685Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
tgi-llava-1  | 2024-04-20T00:13:56.509723Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.9), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: Some(0.6), typical_p: None, do_sample: true, max_new_tokens: Some(3704), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
tgi-llava-1  | 2024-04-20T00:13:56.601488Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
tgi-llava-1  | 
tgi-llava-1  | You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
tgi-llava-1  | Exception ignored in: <function Server.__del__ at 0x7f73512317e0>
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 185, in __del__
tgi-llava-1  |     cygrpc.schedule_coro_threadsafe(
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
tgi-llava-1  |     self._check_closed()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
tgi-llava-1  |     raise RuntimeError('Event loop is closed')
tgi-llava-1  | RuntimeError: Event loop is closed
tgi-llava-1  | sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
tgi-llava-1  | Task exception was never retrieved
tgi-llava-1  | future: <Task finished name='Task-12' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llava-1  |     return await response
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
tgi-llava-1  |     raise error
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
tgi-llava-1  |     return await behavior(request_or_iterator, context)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Prefill
tgi-llava-1  |     generations, next_batch, timings = self.model.generate_token(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llava-1  |     return func(*args, **kwds)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
tgi-llava-1  |     raise e
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
tgi-llava-1  |     out, speculative_logits = self.forward(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 285, in forward
tgi-llava-1  |     logits, speculative_logits = self.model.forward(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 283, in forward
tgi-llava-1  |     inputs_embeds = self._merge_input_ids_with_image_features(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 183, in _merge_input_ids_with_image_features
tgi-llava-1  |     inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
tgi-llava-1  | RuntimeError: shape mismatch: value tensor of shape [1676, 7168] cannot be broadcast to indexing result of shape [2781, 7168]
tgi-llava-1  | 
tgi-llava-1  | During handling of the above exception, another exception occurred:
tgi-llava-1  | 
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llava-1  |     return get_command(self)(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llava-1  |     return self.main(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llava-1  |     return _main(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llava-1  |     rv = self.invoke(ctx)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llava-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llava-1  |     return ctx.invoke(self.callback, **ctx.params)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llava-1  |     return __callback(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llava-1  |     return callback(**use_params)  # type: ignore
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
tgi-llava-1  |     server.serve(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
tgi-llava-1  |     asyncio.run(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llava-1  |     return loop.run_until_complete(main)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llava-1  |     self.run_forever()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llava-1  |     self._run_once()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llava-1  |     handle._run()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llava-1  |     self._context.run(self._callback, *self._args)
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llava-1  |     return await self.intercept(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
tgi-llava-1  |     exit(1)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
tgi-llava-1  |     raise SystemExit(code)
tgi-llava-1  | SystemExit: 1 rank=0
tgi-llava-1  | 2024-04-20T00:13:56.632158Z ERROR text_generation_launcher: Shard 0 crashed
tgi-llava-1  | 2024-04-20T00:13:56.632178Z  INFO text_generation_launcher: Terminating webserver
tgi-llava-1  | 2024-04-20T00:13:56.632196Z  INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
tgi-llava-1  | 2024-04-20T00:13:56.632405Z  INFO text_generation_router::server: router/src/server.rs:1504: signal received, starting graceful shutdown
tgi-llava-1  | 2024-04-20T00:13:56.732331Z  INFO text_generation_launcher: webserver terminated
tgi-llava-1  | 2024-04-20T00:13:56.732350Z  INFO text_generation_launcher: Shutting down shards
tgi-llava-1  | Error: ShardFailed
tgi-llava-1 exited with code 1

Expected behavior

When I run the same script on an image that's square (554x554), it behaves as expected.

test

Response

{'generated_text': "The image shows a young dog with a mix of black and brown fur. It has a curious expression, with wide, dark eyes that are turned towards the camera and a slightly tilted head, suggesting attentiveness. The dog's fur appears soft and shiny, and it has a white area on its muzzle and underbelly, which is common in many dog breeds. The background is a plain light color, providing a stark contrast to the dog's dark fur and highlighting its features. The"}

Logs from cgi

tgi-llava-1  | 2024-04-20T01:42:05.198186Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="7.207770204s" validation_time="74.92µs" queue_time="39.85µs" inference_time="7.207655694s" time_per_token="72.076556ms" seed="Some(12891300100484859231)"}: text_generation_router::server: router/src/server.rs:310: Success

Apr 20 '24 01:04 ktrapeznikov

Sometimes it works landscape images of certain sizes. Some times it also crashes. Do images sizes have to be multiples of 336?

Apr 20 '24 03:04 ktrapeznikov

Same problem Method Prefill encountered an error

Apr 20 '24 12:04 shuaills

It seems that the current implementation counts the tokens generated from the encoded image as part of the prompt length. It might be better to extract the image features first and then calculate the prompt token length separately. I'm not sure if TGI has support for this approach, as it could be quite involved.

Apr 20 '24 13:04 shuaills

Same issue, only width == height image works

Apr 23 '24 05:04 MrToy

I have the same issue, it seems to be linked to image sizes. I found that some sizes work in TGI v2.0.1 but not in TGI v2.0.2, and inversely.

I made here a recap for image size I tested. Note that the 2-bis image is the 2 image cropped, to ensure that the dimension is causing the issue.

Image	dimension	ratio L/W	works in v2.0.1	works in v2.0.2
1	450 x 299	1.505	No	Yes
2	800 x 531	1.506	Yes	No
2 bis	450 x 299	1.505	No	Yes
3	300 x 168	1.785	No	Yes
4	640 x 480	1. 333	Yes	Yes
5	934 x 934 (square)	1	Yes	Yes

When the image hasn't the right dimension, the server encounters an error and crashes. Here are the logs I get:

v2.0.1 (image 1 crash)

ERROR text_generation_launcher: Method Prefill encountered an error.
...
RuntimeError: shape mismatch: value tensor of shape [1464, 4096] cannot be broadcast to indexing result of shape [1376, 4096]
...
ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
ERROR chat_completions:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
...
ERROR text_generation_launcher: Shard 0 crashed

v2.0.2 (image 2 crash, not happening at warmup)

INFO text_generation_launcher: Found 2095 in image of resolution 531x800
ERROR text_generation_launcher: Method Prefill encountered an error.
...
RuntimeError: shape mismatch: value tensor of shape [2144, 4096] cannot be broadcast to indexing result of shape [2095, 4096]
...
RuntimeError: Cannot fill images right now. If error happens at warmup, make sure you have enough `--max-input-tokens`  to handle images. If error happens at regular runtime, please fill in an issue: shape mismatch: value tensor of shape [2144, 4096] cannot be broadcast to indexing result of shape [2095, 4096]
...
ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
ERROR chat_completions:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
...
ERROR text_generation_launcher: Shard 0 crashed

My model info

{
    model_id: "llava-hf/llava-v1.6-mistral-7b-hf",
    validation_workers: 2,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: Some(4000),
    max_total_tokens: Some(5000),
    waiting_served_ratio: 0.3,
    max_waiting_tokens: 20,
    hostname: "0.0.0.0",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some("/data"),
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    json_output: false,
    cors_allow_origin: [],
    ngrok: false,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}

May 02 '24 12:05 alexgravx

Experiencing crashes too

cURL:

curl localhost:8080/v1/chat/completions \
          -X POST \
          -H 'Content-Type: application/json' \
          -d '{
          "model": "tgi",
          "messages": [
              {
                  "role": "user",
                  "content": "Whats in this image?\n![](https://preview.redd.it/cevd0y79p60d1.jpeg?auto=webp&s=e6510c80f65ab4e0c326be29823a83e72aaa87b4)"
              }
          ],
          "stream": false,
          "max_tokens": 20,
          "seed": 42
      }'

Docker

docker run --rm --name tgi --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=xxx -p 8080:80 -v /opt/tgi-cache:/data ghcr.io/huggingface/text-generation-inference:latest --model-id llava-hf/llava-v1.6-mistral-7b-hf

Nvidia

nvidia-smi
Mon May 13 15:22:24 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:09:00.0  On |                  Off |
|  0%   44C    P5             39W /  450W |    1175MiB /  24564MiB |     13%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Logs

2024-05-13T18:21:26.554739Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 158, in _merge_input_ids_with_image_features
    inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
RuntimeError: shape mismatch: value tensor of shape [2256, 4096] cannot be broadcast to indexing result of shape [2208, 4096]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 139, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 948, in generate_token
    raise e
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
    out, speculative_logits = self.forward(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 326, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 264, in forward
    inputs_embeds = self._merge_input_ids_with_image_features(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 160, in _merge_input_ids_with_image_features
    raise RuntimeError(
RuntimeError: Cannot fill images right now. If error happens at warmup, make sure you have enough `--max-input-tokens`  to handle images. If error happens at regular runtime, please fill in an issue: shape mismatch: value tensor of shape [2256, 4096] cannot be broadcast to indexing result of shape [2208, 4096]

2024-05-13T18:21:26.624501Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
2024-05-13T18:21:26.957829Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(1)}:clear_cache{batch_id=Some(1)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
2024-05-13T18:21:26.957851Z ERROR chat_completions:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
2024-05-13T18:21:26.994267Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
  warnings.warn(
Exception ignored in: <function Server.__del__ at 0x74db51d7e4d0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 186, in __del__
    cygrpc.schedule_coro_threadsafe(
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
    self._check_closed()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
future: <Task finished name='Task-52' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 158, in _merge_input_ids_with_image_features
    inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
RuntimeError: shape mismatch: value tensor of shape [2256, 4096] cannot be broadcast to indexing result of shape [2208, 4096]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 139, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 948, in generate_token
    raise e
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
    out, speculative_logits = self.forward(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 326, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 264, in forward
    inputs_embeds = self._merge_input_ids_with_image_features(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 160, in _merge_input_ids_with_image_features
    raise RuntimeError(
RuntimeError: Cannot fill images right now. If error happens at warmup, make sure you have enough `--max-input-tokens`  to handle images. If error happens at regular runtime, please fill in an issue: shape mismatch: value tensor of shape [2256, 4096] cannot be broadcast to indexing result of shape [2208, 4096]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
    exit(1)
  File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
    raise SystemExit(code)
SystemExit: 1 rank=0
2024-05-13T18:21:26.996139Z ERROR text_generation_launcher: Shard 0 crashed
2024-05-13T18:21:26.996144Z  INFO text_generation_launcher: Terminating webserver
2024-05-13T18:21:26.996150Z  INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
2024-05-13T18:21:26.996183Z  INFO text_generation_router::server: router/src/server.rs:1740: signal received, starting graceful shutdown
2024-05-13T18:21:27.096237Z  INFO text_generation_launcher: webserver terminated
2024-05-13T18:21:27.096244Z  INFO text_generation_launcher: Shutting down shards
Error: ShardFailed

May 13 '24 18:05 xfalcox

cc @Narsil any ideas how hard a fix here would be?

We're considering moving to TGI for our Llava-Next traffic, but the entire Docker container crashes and stops on the very first image we tried.

May 15 '24 16:05 xfalcox

same issue found and any fix? @Narsil

May 17 '24 00:05 paulcx

Hit same issue with idefics2 model. Crashed hard.

ESC[2m2024-05-26T17:59:36.868147ZESC[0m ESC[31mERRORESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 131, in Prefill
    batch = self.model.batch_type.from_pb_processor(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 235, in from_pb_processor
    batch_tokenized_inputs, image_inputs = cls.batch_tokenized_inputs(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 187, in batch_tokenized_inputs
    raise RuntimeError(
RuntimeError: Cannot process input image not starting with data:

ESC[2m2024-05-26T17:59:36.974091ZESC[0m ESC[31mERRORESC[0m ESC[1mbatchESC[0mESC[1m{ESC[0mESC[3mbatch_sizeESC[0mESC[2m=ESC[0m1ESC[1m}ESC[0mESC[2m:ESC[0mESC[1mprefillESC[0mESC[2m:ESC[0mESC[1mprefillESC[0mESC[1m{ESC[0mESC[3midESC[0mESC[2m=ESC[0m587 ESC[3msizeESC[0mESC[>
ESC[2m2024-05-26T17:59:37.429699ZESC[0m ESC[31mERRORESC[0m ESC[1mbatchESC[0mESC[1m{ESC[0mESC[3mbatch_sizeESC[0mESC[2m=ESC[0m1ESC[1m}ESC[0mESC[2m:ESC[0mESC[1mprefillESC[0mESC[2m:ESC[0mESC[1mclear_cacheESC[0mESC[1m{ESC[0mESC[3mbatch_idESC[0mESC[2m=ESC[0mSome(587)ESC[1>
ESC[2m2024-05-26T17:59:37.429718ZESC[0m ESC[31mERRORESC[0m ESC[1mcompat_generateESC[0mESC[1m{ESC[0mESC[3mdefault_return_full_textESC[0mESC[2m=ESC[0mfalse ESC[3mcompute_typeESC[0mESC[2m=ESC[0mExtension(ComputeType("1-nvidia-h100-80gb-hbm3"))ESC[1m}ESC[0mESC[2m:ESC[0m>
ESC[2m2024-05-26T17:59:37.594424ZESC[0m ESC[31mERRORESC[0m ESC[1mshard-managerESC[0m: ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Shard complete standard error output:

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of Py>
  warnings.warn(
Exception ignored in: <function Server.__del__ at 0x7b0132b04550>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 186, in __del__
    cygrpc.schedule_coro_threadsafe(
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
:

  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
    self._check_closed()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
future: <Task finished name='Task-253498' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 131, in Prefill
    batch = self.model.batch_type.from_pb_processor(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 235, in from_pb_processor
    batch_tokenized_inputs, image_inputs = cls.batch_tokenized_inputs(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 187, in batch_tokenized_inputs
    raise RuntimeError(
RuntimeError: Cannot process input image not starting with data:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
    exit(1)
    exit(1)
  File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
    raise SystemExit(code)
SystemExit: 1 ESC[2mESC[3mrankESC[0mESC[2m=ESC[0m0ESC[0m
ESC[2m2024-05-26T17:59:37.612016ZESC[0m ESC[31mERRORESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Shard 0 crashed
ESC[2m2024-05-26T17:59:37.612029ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Terminating webserver
ESC[2m2024-05-26T17:59:37.612040ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Waiting for webserver to gracefully shutdown
ESC[2m2024-05-26T17:59:37.612090ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_router::serverESC[0mESC[2m:ESC[0m ESC[2mrouter/src/server.rsESC[0mESC[2m:ESC[0mESC[2m1740:ESC[0m signal received, starting graceful shutdown
ESC[2m2024-05-26T17:59:37.712140ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m webserver terminated
ESC[2m2024-05-26T17:59:37.712145ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Shutting down shards
Error: ShardFaile

Unsure if about image sizes, maybe something else. Was among one use over many hundreds of uses.

May 27 '24 22:05 pseudotensor

@pseudotensor Is this with latest version?

May 27 '24 22:05 ktrapeznikov

No, 2.0.3. I will try 2.0.4 thanks!

May 27 '24 22:05 pseudotensor

This still needs to be rebased and reviewed, but this should be fixed with PR #2097 if anyone wants to try.

Jun 21 '24 08:06 danieldk

This should be fixed by #2080 (#2097 became part of that PR).

Jun 27 '24 14:06 danieldk

text-generation-inference text-generation-inference copied to clipboard

Llava Next crashes on certain image sizes

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard