text-generation-inference
                                
                                 text-generation-inference copied to clipboard
                                
                                    text-generation-inference copied to clipboard
                            
                            
                            
                        Large Language Model Text Generation Inference
### System Info Version: ghcr.io/huggingface/text-generation-inference:latest OS: Ubuntu 22.04 LTS GPU: 1 x A100 80GB GPU on azure  ### Information - [X] Docker - [ ] The CLI directly ###...
Hello, After the new 0.9 update, it seems to be that there is a new "Warmup Model" feature added at the start. This is causing an issue where the model...
This PR adds to TGI the mixed precision int4/fp16 kernels from the excellent [exllama repo](https://github.com/turboderp/exllama), that from [my benchmark](https://github.com/fxmarty/q4f16-gemm-gemv-benchmark) is much better than the implementations available in autogptq & gptq-for-llama....
### Feature request [Stay on topic with Classifier-Free Guidance](https://arxiv.org/abs/2306.17806) CFG brings non trivial improvements on many standard benchmarks. ### Motivation The response quality of LLMs using CFG averaged similarly to...
### System Info text-generation-inference: 0.9.0 Target: x86_64-unknown-linux-gnu Cargo version: 1.70.0 Commit sha: e28a809004620c3f3a1cc28d4bbc0b4775b1328f Docker label: sha-e28a809 nvidia-smi: ```bash +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.216.04 Driver Version: 450.216.04 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+...
### Feature request I would like to raise a feature request for quantisation of MPT–30b models. ### Motivation MPT-30b models with larger number of token size take huge space in...
### Feature request Add a `--hostname` argument to the [entrypoint of the router](https://github.com/philhchen/text-generation-inference/blob/31e2253ae721ea80032283b9e85ffe51945e5a55/router/src/main.rs#L24). ### Motivation For dual-stack k8s clusters that use IPv6 addressing, the `text-generation-inference` Docker image is insufficient because...
After running: > docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id google/flan-t5-small --num-shard 1 I recieve: > RuntimeError: weight encoder.embed_tokens.weight does not exist I tried multiple...
### System Info Ubuntu 20.04 4 A10 NVIDIA GPU's I think checkpoints saved after this feature was merged don't work with text-generation-inference. https://github.com/huggingface/transformers/issues/23868 With falcon models getting "`lm_head` not found"...
@Narsil @drbh this will update flash attention v2 and vllm. You will need to re-install them.