text-generation-inference issues

TGI hangs when running two extremely long prompts at once

1

### System Info exact command used to run TGI: `docker run --gpus all --shm-size 1g -p 5000:80 -v /mnt/disk/models/llama-3.3-70b-instruct-awq:/usr/src/llama-3.3-70b -it ghcr.io/huggingface/text-generation-inference:3.0.1 --model-id llama-3.3-70b --quantize awq --cuda-memory-fraction 1 --sharded true --num-shard...

JohnTheNerd

Request failed during generation: Server error: Value out of range: -29146814772

### System Info text-generation-launcher 3.1.1-dev0 Single RTX 4070 S GPU NVIDIA-SMI 572.16 Driver Version: 572.16 CUDA Version: 12.8 Models Used : meta-llama/Llama-3.1-8B-Instruct, Yujivus/DeepSeek-R1-Distill-Llama-8B-AWQ, Yujivus/Phi-4-Health-CoT-1.1-AWQ Docker Command: docker run --name tgi-server...

AlperYildirim1

Default TGI Inference parameter values

1

### System Info Hi team, We are trying to get the default parameter values that is being used while invoking a fine-tuned model which is deployed using TGI (latest version)....

ashwincv0112

misc(gha): expose action cache url and runtime as secrets

1

Avoid leaking token and cache url

mfuntowicz

Mistral Small 3 : chat template with python functions causes error

1

### Model description Mistral recent comment uses this template ``` "chat_template": "{%- set today = strftime_now(\"%Y-%m-%d\") %}\n{%- set default_system_message = \"You are Mistral Small 3, a Large Language Model (LLM)...

v3ss0n

Update Dockerfile to use devel image for compatibility

9

# What does this PR do? The TGI server fails to start due to missing Python headers during the compilation of Triton indexing kernels. The solution is to change the...

YaserJaradeh

chat API doesn't support/respect `n` parameter

3

### System Info Docker container: `ghcr.io/huggingface/text-generation-inference:3.0.0` ### Information - [x] Docker - [ ] The CLI directly ### Tasks - [x] An officially supported command - [ ] My own...

zfang

Incorrect Tokenization Likely Because of Diacritics in OpenChat and LLaMA 3.2 (TGI v3.0.2 and v2.2.0)

### System Info TGI versions 3.0.2 and 2.2.0, official docker images. Windows 11. GPU: NVIDIA GeForce RTX 4060 Ti, 16 GB memory, NVIDIA-SMI 565.77.01 Driver Version: 566.36 CUDA Version: 12.7...

biba10

Kvrouter that will increase the kv-cache hits in case of multiple routing strategy

1

# What does this PR do? Strategy is on purpose relatively stupid in order to account for many types of factor. Currently the kv-cache hit rate (on 4 replicas) bumps...

Narsil

[Backend] Introduce vLLM backend

mfuntowicz

text-generation-inference
text-generation-inference copied to clipboard

Metadata

TGI hangs when running two extremely long prompts at once

Request failed during generation: Server error: Value out of range: -29146814772

Default TGI Inference parameter values

misc(gha): expose action cache url and runtime as secrets

Mistral Small 3 : chat template with python functions causes error

Update Dockerfile to use devel image for compatibility

chat API doesn't support/respect `n` parameter

Incorrect Tokenization Likely Because of Diacritics in OpenChat and LLaMA 3.2 (TGI v3.0.2 and v2.2.0)

Kvrouter that will increase the kv-cache hits in case of multiple routing strategy

[Backend] Introduce vLLM backend

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard