text-generation-inference
text-generation-inference copied to clipboard
Llava Next crashes on certain image sizes
System Info
Running in docker
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: 00f365353ea5cf29438ba1d51baadaab79ae4674
Docker label: sha-00f3653
nvidia-smi:
Sat Apr 20 00:19:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:44:00.0 Off | Off |
| 30% 33C P8 21W / 300W | 45076MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
CLI Arguments
model_id: "llava-hf/llava-v1.6-34b-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(BitsandbytesNF4), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(4095), max_total_tokens: Some(4096), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(4096), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true
Info
{
"model_id": "llava-hf/llava-v1.6-34b-hf",
"model_sha": "5400ac92f6e1595288302ba9ab20db8542c0b8e5",
"model_dtype": "torch.float16",
"model_device_type": "cuda",
"model_pipeline_tag": "image-text-to-text",
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_length": 4095,
"max_total_tokens": 4096,
"waiting_served_ratio": 1.2,
"max_batch_total_tokens": 108112,
"max_waiting_tokens": 20,
"max_batch_size": null,
"validation_workers": 2,
"version": "2.0.0",
"sha": "00f365353ea5cf29438ba1d51baadaab79ae4674",
"docker_label": "sha-00f3653"
}
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Here is a script that I run on this image with the prompt Describe the image?. Note the image is (286 × 524). It returns an error and the service crashes.
from PIL import Image
import requests
import base64
from io import BytesIO
# fetch image
image = Image.open("test2.jpeg")
# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG") # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')
# format image string
image_string = f"data:image/jpeg;base64,{base64_image}"
query = "Describe the image?"
prompt=f"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n\n{query}<|im_end|><|im_start|>assistant\n"
headers = {
"Accept" : "application/json",
"Content-Type": "application/json"
}
payload = {"inputs":prompt}
response = requests.post("endpoint/generate", headers=headers, json=payload)
response.json()
{'error': 'Request failed during generation: Server error: CANCELLED',
'error_type': 'generation'}
Logs from the tgi service
tgi-llava-1 | 2024-04-20T00:13:55.522584Z ERROR text_generation_launcher: Method Prefill encountered an error.
tgi-llava-1 | Traceback (most recent call last):
tgi-llava-1 | File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-llava-1 | sys.exit(app())
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llava-1 | return get_command(self)(*args, **kwargs)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llava-1 | return self.main(*args, **kwargs)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llava-1 | return _main(
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llava-1 | rv = self.invoke(ctx)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llava-1 | return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llava-1 | return ctx.invoke(self.callback, **ctx.params)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llava-1 | return __callback(*args, **kwargs)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llava-1 | return callback(**use_params) # type: ignore
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
tgi-llava-1 | server.serve(
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
tgi-llava-1 | asyncio.run(
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llava-1 | return loop.run_until_complete(main)
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llava-1 | self.run_forever()
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llava-1 | self._run_once()
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llava-1 | handle._run()
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llava-1 | self._context.run(self._callback, *self._args)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llava-1 | return await self.intercept(
tgi-llava-1 | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llava-1 | return await response
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
tgi-llava-1 | raise error
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
tgi-llava-1 | return await behavior(request_or_iterator, context)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Prefill
tgi-llava-1 | generations, next_batch, timings = self.model.generate_token(batch)
tgi-llava-1 | File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llava-1 | return func(*args, **kwds)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
tgi-llava-1 | raise e
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
tgi-llava-1 | out, speculative_logits = self.forward(batch)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 285, in forward
tgi-llava-1 | logits, speculative_logits = self.model.forward(
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 283, in forward
tgi-llava-1 | inputs_embeds = self._merge_input_ids_with_image_features(
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 183, in _merge_input_ids_with_image_features
tgi-llava-1 | inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
tgi-llava-1 | RuntimeError: shape mismatch: value tensor of shape [1676, 7168] cannot be broadcast to indexing result of shape [2781, 7168]
tgi-llava-1 |
tgi-llava-1 | 2024-04-20T00:13:55.879011Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
tgi-llava-1 | 2024-04-20T00:13:56.509685Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
tgi-llava-1 | 2024-04-20T00:13:56.509723Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.9), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: Some(0.6), typical_p: None, do_sample: true, max_new_tokens: Some(3704), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
tgi-llava-1 | 2024-04-20T00:13:56.601488Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
tgi-llava-1 |
tgi-llava-1 | You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
tgi-llava-1 | Exception ignored in: <function Server.__del__ at 0x7f73512317e0>
tgi-llava-1 | Traceback (most recent call last):
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 185, in __del__
tgi-llava-1 | cygrpc.schedule_coro_threadsafe(
tgi-llava-1 | File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
tgi-llava-1 | File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
tgi-llava-1 | self._check_closed()
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
tgi-llava-1 | raise RuntimeError('Event loop is closed')
tgi-llava-1 | RuntimeError: Event loop is closed
tgi-llava-1 | sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
tgi-llava-1 | Task exception was never retrieved
tgi-llava-1 | future: <Task finished name='Task-12' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
tgi-llava-1 | Traceback (most recent call last):
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llava-1 | return await response
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
tgi-llava-1 | raise error
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
tgi-llava-1 | return await behavior(request_or_iterator, context)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Prefill
tgi-llava-1 | generations, next_batch, timings = self.model.generate_token(batch)
tgi-llava-1 | File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llava-1 | return func(*args, **kwds)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
tgi-llava-1 | raise e
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
tgi-llava-1 | out, speculative_logits = self.forward(batch)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 285, in forward
tgi-llava-1 | logits, speculative_logits = self.model.forward(
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 283, in forward
tgi-llava-1 | inputs_embeds = self._merge_input_ids_with_image_features(
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 183, in _merge_input_ids_with_image_features
tgi-llava-1 | inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
tgi-llava-1 | RuntimeError: shape mismatch: value tensor of shape [1676, 7168] cannot be broadcast to indexing result of shape [2781, 7168]
tgi-llava-1 |
tgi-llava-1 | During handling of the above exception, another exception occurred:
tgi-llava-1 |
tgi-llava-1 | Traceback (most recent call last):
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llava-1 | return get_command(self)(*args, **kwargs)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llava-1 | return self.main(*args, **kwargs)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llava-1 | return _main(
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llava-1 | rv = self.invoke(ctx)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llava-1 | return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llava-1 | return ctx.invoke(self.callback, **ctx.params)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llava-1 | return __callback(*args, **kwargs)
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llava-1 | return callback(**use_params) # type: ignore
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
tgi-llava-1 | server.serve(
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
tgi-llava-1 | asyncio.run(
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llava-1 | return loop.run_until_complete(main)
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llava-1 | self.run_forever()
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llava-1 | self._run_once()
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llava-1 | handle._run()
tgi-llava-1 | File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llava-1 | self._context.run(self._callback, *self._args)
tgi-llava-1 | File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
tgi-llava-1 | File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
tgi-llava-1 | File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
tgi-llava-1 | File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
tgi-llava-1 | File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llava-1 | return await self.intercept(
tgi-llava-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
tgi-llava-1 | exit(1)
tgi-llava-1 | File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
tgi-llava-1 | raise SystemExit(code)
tgi-llava-1 | SystemExit: 1 rank=0
tgi-llava-1 | 2024-04-20T00:13:56.632158Z ERROR text_generation_launcher: Shard 0 crashed
tgi-llava-1 | 2024-04-20T00:13:56.632178Z INFO text_generation_launcher: Terminating webserver
tgi-llava-1 | 2024-04-20T00:13:56.632196Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
tgi-llava-1 | 2024-04-20T00:13:56.632405Z INFO text_generation_router::server: router/src/server.rs:1504: signal received, starting graceful shutdown
tgi-llava-1 | 2024-04-20T00:13:56.732331Z INFO text_generation_launcher: webserver terminated
tgi-llava-1 | 2024-04-20T00:13:56.732350Z INFO text_generation_launcher: Shutting down shards
tgi-llava-1 | Error: ShardFailed
tgi-llava-1 exited with code 1
Expected behavior
When I run the same script on an image that's square (554x554), it behaves as expected.
Response
{'generated_text': "The image shows a young dog with a mix of black and brown fur. It has a curious expression, with wide, dark eyes that are turned towards the camera and a slightly tilted head, suggesting attentiveness. The dog's fur appears soft and shiny, and it has a white area on its muzzle and underbelly, which is common in many dog breeds. The background is a plain light color, providing a stark contrast to the dog's dark fur and highlighting its features. The"}
Logs from cgi
tgi-llava-1 | 2024-04-20T01:42:05.198186Z INFO generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="7.207770204s" validation_time="74.92µs" queue_time="39.85µs" inference_time="7.207655694s" time_per_token="72.076556ms" seed="Some(12891300100484859231)"}: text_generation_router::server: router/src/server.rs:310: Success
Sometimes it works landscape images of certain sizes. Some times it also crashes. Do images sizes have to be multiples of 336?
Same problem Method Prefill encountered an error
It seems that the current implementation counts the tokens generated from the encoded image as part of the prompt length. It might be better to extract the image features first and then calculate the prompt token length separately. I'm not sure if TGI has support for this approach, as it could be quite involved.
Same issue, only width == height image works
I have the same issue, it seems to be linked to image sizes. I found that some sizes work in TGI v2.0.1 but not in TGI v2.0.2, and inversely.
I made here a recap for image size I tested. Note that the 2-bis image is the 2 image cropped, to ensure that the dimension is causing the issue.
| Image | dimension | ratio L/W | works in v2.0.1 | works in v2.0.2 |
|---|---|---|---|---|
| 1 | 450 x 299 | 1.505 | No | Yes |
| 2 | 800 x 531 | 1.506 | Yes | No |
| 2 bis | 450 x 299 | 1.505 | No | Yes |
| 3 | 300 x 168 | 1.785 | No | Yes |
| 4 | 640 x 480 | 1. 333 | Yes | Yes |
| 5 | 934 x 934 (square) | 1 | Yes | Yes |
When the image hasn't the right dimension, the server encounters an error and crashes. Here are the logs I get:
v2.0.1 (image 1 crash)
ERROR text_generation_launcher: Method Prefill encountered an error.
...
RuntimeError: shape mismatch: value tensor of shape [1464, 4096] cannot be broadcast to indexing result of shape [1376, 4096]
...
ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
ERROR chat_completions:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
...
ERROR text_generation_launcher: Shard 0 crashed
v2.0.2 (image 2 crash, not happening at warmup)
INFO text_generation_launcher: Found 2095 in image of resolution 531x800
ERROR text_generation_launcher: Method Prefill encountered an error.
...
RuntimeError: shape mismatch: value tensor of shape [2144, 4096] cannot be broadcast to indexing result of shape [2095, 4096]
...
RuntimeError: Cannot fill images right now. If error happens at warmup, make sure you have enough `--max-input-tokens` to handle images. If error happens at regular runtime, please fill in an issue: shape mismatch: value tensor of shape [2144, 4096] cannot be broadcast to indexing result of shape [2095, 4096]
...
ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
ERROR chat_completions:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
...
ERROR text_generation_launcher: Shard 0 crashed
My model info
{
model_id: "llava-hf/llava-v1.6-mistral-7b-hf",
validation_workers: 2,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(4000),
max_total_tokens: Some(5000),
waiting_served_ratio: 0.3,
max_waiting_tokens: 20,
hostname: "0.0.0.0",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some("/data"),
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
json_output: false,
cors_allow_origin: [],
ngrok: false,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
}
Experiencing crashes too
cURL:
curl localhost:8080/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "tgi",
"messages": [
{
"role": "user",
"content": "Whats in this image?\n"
}
],
"stream": false,
"max_tokens": 20,
"seed": 42
}'
Docker
docker run --rm --name tgi --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=xxx -p 8080:80 -v /opt/tgi-cache:/data ghcr.io/huggingface/text-generation-inference:latest --model-id llava-hf/llava-v1.6-mistral-7b-hf
Nvidia
nvidia-smi
Mon May 13 15:22:24 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:09:00.0 On | Off |
| 0% 44C P5 39W / 450W | 1175MiB / 24564MiB | 13% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Logs
2024-05-13T18:21:26.554739Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 158, in _merge_input_ids_with_image_features
inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
RuntimeError: shape mismatch: value tensor of shape [2256, 4096] cannot be broadcast to indexing result of shape [2208, 4096]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 139, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 948, in generate_token
raise e
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
out, speculative_logits = self.forward(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 326, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 264, in forward
inputs_embeds = self._merge_input_ids_with_image_features(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 160, in _merge_input_ids_with_image_features
raise RuntimeError(
RuntimeError: Cannot fill images right now. If error happens at warmup, make sure you have enough `--max-input-tokens` to handle images. If error happens at regular runtime, please fill in an issue: shape mismatch: value tensor of shape [2256, 4096] cannot be broadcast to indexing result of shape [2208, 4096]
2024-05-13T18:21:26.624501Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
2024-05-13T18:21:26.957829Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(1)}:clear_cache{batch_id=Some(1)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
2024-05-13T18:21:26.957851Z ERROR chat_completions:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
2024-05-13T18:21:26.994267Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
warnings.warn(
Exception ignored in: <function Server.__del__ at 0x74db51d7e4d0>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 186, in __del__
cygrpc.schedule_coro_threadsafe(
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
self._check_closed()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
future: <Task finished name='Task-52' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 158, in _merge_input_ids_with_image_features
inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
RuntimeError: shape mismatch: value tensor of shape [2256, 4096] cannot be broadcast to indexing result of shape [2208, 4096]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 139, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 948, in generate_token
raise e
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
out, speculative_logits = self.forward(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 326, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 264, in forward
inputs_embeds = self._merge_input_ids_with_image_features(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 160, in _merge_input_ids_with_image_features
raise RuntimeError(
RuntimeError: Cannot fill images right now. If error happens at warmup, make sure you have enough `--max-input-tokens` to handle images. If error happens at regular runtime, please fill in an issue: shape mismatch: value tensor of shape [2256, 4096] cannot be broadcast to indexing result of shape [2208, 4096]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
exit(1)
File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
raise SystemExit(code)
SystemExit: 1 rank=0
2024-05-13T18:21:26.996139Z ERROR text_generation_launcher: Shard 0 crashed
2024-05-13T18:21:26.996144Z INFO text_generation_launcher: Terminating webserver
2024-05-13T18:21:26.996150Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
2024-05-13T18:21:26.996183Z INFO text_generation_router::server: router/src/server.rs:1740: signal received, starting graceful shutdown
2024-05-13T18:21:27.096237Z INFO text_generation_launcher: webserver terminated
2024-05-13T18:21:27.096244Z INFO text_generation_launcher: Shutting down shards
Error: ShardFailed
cc @Narsil any ideas how hard a fix here would be?
We're considering moving to TGI for our Llava-Next traffic, but the entire Docker container crashes and stops on the very first image we tried.
same issue found and any fix? @Narsil
Hit same issue with idefics2 model. Crashed hard.
ESC[2m2024-05-26T17:59:36.868147ZESC[0m ESC[31mERRORESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 131, in Prefill
batch = self.model.batch_type.from_pb_processor(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 235, in from_pb_processor
batch_tokenized_inputs, image_inputs = cls.batch_tokenized_inputs(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 187, in batch_tokenized_inputs
raise RuntimeError(
RuntimeError: Cannot process input image not starting with data:
ESC[2m2024-05-26T17:59:36.974091ZESC[0m ESC[31mERRORESC[0m ESC[1mbatchESC[0mESC[1m{ESC[0mESC[3mbatch_sizeESC[0mESC[2m=ESC[0m1ESC[1m}ESC[0mESC[2m:ESC[0mESC[1mprefillESC[0mESC[2m:ESC[0mESC[1mprefillESC[0mESC[1m{ESC[0mESC[3midESC[0mESC[2m=ESC[0m587 ESC[3msizeESC[0mESC[>
ESC[2m2024-05-26T17:59:37.429699ZESC[0m ESC[31mERRORESC[0m ESC[1mbatchESC[0mESC[1m{ESC[0mESC[3mbatch_sizeESC[0mESC[2m=ESC[0m1ESC[1m}ESC[0mESC[2m:ESC[0mESC[1mprefillESC[0mESC[2m:ESC[0mESC[1mclear_cacheESC[0mESC[1m{ESC[0mESC[3mbatch_idESC[0mESC[2m=ESC[0mSome(587)ESC[1>
ESC[2m2024-05-26T17:59:37.429718ZESC[0m ESC[31mERRORESC[0m ESC[1mcompat_generateESC[0mESC[1m{ESC[0mESC[3mdefault_return_full_textESC[0mESC[2m=ESC[0mfalse ESC[3mcompute_typeESC[0mESC[2m=ESC[0mExtension(ComputeType("1-nvidia-h100-80gb-hbm3"))ESC[1m}ESC[0mESC[2m:ESC[0m>
ESC[2m2024-05-26T17:59:37.594424ZESC[0m ESC[31mERRORESC[0m ESC[1mshard-managerESC[0m: ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Shard complete standard error output:
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of Py>
warnings.warn(
Exception ignored in: <function Server.__del__ at 0x7b0132b04550>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 186, in __del__
cygrpc.schedule_coro_threadsafe(
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
:
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
self._check_closed()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
future: <Task finished name='Task-253498' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 131, in Prefill
batch = self.model.batch_type.from_pb_processor(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 235, in from_pb_processor
batch_tokenized_inputs, image_inputs = cls.batch_tokenized_inputs(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 187, in batch_tokenized_inputs
raise RuntimeError(
RuntimeError: Cannot process input image not starting with data:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
exit(1)
exit(1)
File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
raise SystemExit(code)
SystemExit: 1 ESC[2mESC[3mrankESC[0mESC[2m=ESC[0m0ESC[0m
ESC[2m2024-05-26T17:59:37.612016ZESC[0m ESC[31mERRORESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Shard 0 crashed
ESC[2m2024-05-26T17:59:37.612029ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Terminating webserver
ESC[2m2024-05-26T17:59:37.612040ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Waiting for webserver to gracefully shutdown
ESC[2m2024-05-26T17:59:37.612090ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_router::serverESC[0mESC[2m:ESC[0m ESC[2mrouter/src/server.rsESC[0mESC[2m:ESC[0mESC[2m1740:ESC[0m signal received, starting graceful shutdown
ESC[2m2024-05-26T17:59:37.712140ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m webserver terminated
ESC[2m2024-05-26T17:59:37.712145ZESC[0m ESC[32m INFOESC[0m ESC[2mtext_generation_launcherESC[0mESC[2m:ESC[0m Shutting down shards
Error: ShardFaile
Unsure if about image sizes, maybe something else. Was among one use over many hundreds of uses.
@pseudotensor Is this with latest version?
No, 2.0.3. I will try 2.0.4 thanks!
This still needs to be rebased and reviewed, but this should be fixed with PR #2097 if anyone wants to try.
This should be fixed by #2080 (#2097 became part of that PR).