text-generation-inference
                                
                                 text-generation-inference copied to clipboard
                                
                                    text-generation-inference copied to clipboard
                            
                            
                            
                        Large Language Model Text Generation Inference
### Feature request During testing for [https://github.com/BerriAI/litellm/pull/7747](https://github.com/BerriAI/litellm/pull/7747) , I found that the differences between OpenAI and TGI are more fundamental than just the optionality of the provided schema. The `response_format`...
### System Info I was using v2.3.1 via docker and everything was working. When I updated to later versions including the latest my TGI doesn't start due to an error:...
### System Info Using the 3.1.0 docker container in an AWS g6.12xlarge instance. `--env` output: ``` 2025-02-19T17:51:35.116359Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.84.0 Commit sha: 463228ebfc444f60fa351da34a2ba158af0fe9d8 Docker...
### System Info text-generation-inference docker image version 3.1.0 with the following parameters (log by TGI): INFO text_generation_launcher: Args { model_id: "google/flan-t5-xxl", revision: None, validation_workers: 2, sharded: None, num_shard: Some( 1,...
### System Info The docker file has non-commercial conda installation whereas the tgi is Apache-2.0 License. Installing tgi through docker creates conda license violation. https://github.com/huggingface/text-generation-inference/blob/main/Dockerfile ### Information - [x] Docker...
### System Info docker deploy ``` $ nvidia-smi Thu Feb 13 23:44:10 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M |...
### System Info CPU affinity implementation (introduced in commit 59922f9bc16afee9efcc7ee1c5f9d753ef314ffa, first released in v2.3.0, until current HEAD (4b8cda684b45b799de01a65e3fe3422a34a621d3) ignores already existing CPU pinning for the process. ### Information - [x]...
### Feature request Any chance we could get support for RTX 3090? ### Motivation I have an RTX 3090 and would like to utilize it. ### Your contribution I'm not...
### System Info text-generation-inference 3.1.0 (saw the same issue on 3.0.0) ```shell model="NousResearch/Meta-Llama-3.1-8B-Instruct" volume="$PWD/data" docker create --name llama3.1-speculate2 --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id $model --quantize...
### System Info I'm trying to deploy Qwen/Qwen2-VL-2B-Instruct on MI210 using ghcr.io/huggingface/text-generation-inference:3.1.0-rocm but it fails in the warmup step with error: ``` INFO text_generation_launcher: Using attention paged - Prefix caching...