text-generation-inference
                                
                                 text-generation-inference copied to clipboard
                                
                                    text-generation-inference copied to clipboard
                            
                            
                            
                        Large Language Model Text Generation Inference
### System Info exact command used to run TGI: `docker run --gpus all --shm-size 1g -p 5000:80 -v /mnt/disk/models/llama-3.3-70b-instruct-awq:/usr/src/llama-3.3-70b -it ghcr.io/huggingface/text-generation-inference:3.0.1 --model-id llama-3.3-70b --quantize awq --cuda-memory-fraction 1 --sharded true --num-shard...
### System Info text-generation-launcher 3.1.1-dev0 Single RTX 4070 S GPU NVIDIA-SMI 572.16 Driver Version: 572.16 CUDA Version: 12.8 Models Used : meta-llama/Llama-3.1-8B-Instruct, Yujivus/DeepSeek-R1-Distill-Llama-8B-AWQ, Yujivus/Phi-4-Health-CoT-1.1-AWQ Docker Command: docker run --name tgi-server...
### System Info Hi team, We are trying to get the default parameter values that is being used while invoking a fine-tuned model which is deployed using TGI (latest version)....
Avoid leaking token and cache url
### Model description Mistral recent comment uses this template ``` "chat_template": "{%- set today = strftime_now(\"%Y-%m-%d\") %}\n{%- set default_system_message = \"You are Mistral Small 3, a Large Language Model (LLM)...
# What does this PR do? The TGI server fails to start due to missing Python headers during the compilation of Triton indexing kernels. The solution is to change the...
### System Info Docker container: `ghcr.io/huggingface/text-generation-inference:3.0.0` ### Information - [x] Docker - [ ] The CLI directly ### Tasks - [x] An officially supported command - [ ] My own...
### System Info TGI versions 3.0.2 and 2.2.0, official docker images. Windows 11. GPU: NVIDIA GeForce RTX 4060 Ti, 16 GB memory, NVIDIA-SMI 565.77.01 Driver Version: 566.36 CUDA Version: 12.7...
# What does this PR do? Strategy is on purpose relatively stupid in order to account for many types of factor. Currently the kv-cache hit rate (on 4 replicas) bumps...