text-generation-inference issues

falcon-7b-instruct model unexpected text generation without flash attention

### System Info Version: ghcr.io/huggingface/text-generation-inference:latest OS: Ubuntu 22.04 LTS GPU: 1 x A100 80GB GPU on azure ![2023-07-04_14-51-20](https://github.com/huggingface/text-generation-inference/assets/86181705/3e715977-f0af-4b8e-8b04-e72537b0eed5) ### Information - [X] Docker - [ ] The CLI directly ###...

chironito

Warming up model

1

Hello, After the new 0.9 update, it seems to be that there is a new "Warmup Model" feature added at the start. This is causing an issue where the model...

Ichigo3766

Add exllama GPTQ CUDA kernel support

This PR adds to TGI the mixed precision int4/fp16 kernels from the excellent [exllama repo](https://github.com/turboderp/exllama), that from [my benchmark](https://github.com/fxmarty/q4f16-gemm-gemv-benchmark) is much better than the implementations available in autogptq & gptq-for-llama....

fxmarty

Feature request: Classifier-Free Guidance (CFG) Sampling

### Feature request [Stay on topic with Classifier-Free Guidance](https://arxiv.org/abs/2306.17806) CFG brings non trivial improvements on many standard benchmarks. ### Motivation The response quality of LLMs using CFG averaged similarly to...

dongs0104

Quantizing Falcon Instruct Model fails at tgi 0.9.0

### System Info text-generation-inference: 0.9.0 Target: x86_64-unknown-linux-gnu Cargo version: 1.70.0 Commit sha: e28a809004620c3f3a1cc28d4bbc0b4775b1328f Docker label: sha-e28a809 nvidia-smi: ```bash +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.216.04 Driver Version: 450.216.04 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+...

ChristophRaab

GPTQ quantization for MPT-30 models

1

### Feature request I would like to raise a feature request for quantisation of MPT–30b models. ### Motivation MPT-30b models with larger number of token size take huge space in...

ankit201

Support custom hostnames in router

### Feature request Add a `--hostname` argument to the [entrypoint of the router](https://github.com/philhchen/text-generation-inference/blob/31e2253ae721ea80032283b9e85ffe51945e5a55/router/src/main.rs#L24). ### Motivation For dual-stack k8s clusters that use IPv6 addressing, the `text-generation-inference` Docker image is insufficient because...

philhchen

RuntimeError: weight encoder.embed_tokens.weight does not exist

After running: > docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id google/flan-t5-small --num-shard 1 I recieve: > RuntimeError: weight encoder.embed_tokens.weight does not exist I tried multiple...

chumpblocckami

Tied weight optimization for checkpoints doesn't work with text-generation-inference.

### System Info Ubuntu 20.04 4 A10 NVIDIA GPU's I think checkpoints saved after this feature was merged don't work with text-generation-inference. https://github.com/huggingface/transformers/issues/23868 With falcon models getting "`lm_head` not found"...

jenkspt

fix: fix CohereForAI/c4ai-command-r-plus

@Narsil @drbh this will update flash attention v2 and vllm. You will need to re-install them.

OlivierDehaene

text-generation-inference
text-generation-inference copied to clipboard

Metadata

falcon-7b-instruct model unexpected text generation without flash attention

Warming up model

Add exllama GPTQ CUDA kernel support

Feature request: Classifier-Free Guidance (CFG) Sampling

Quantizing Falcon Instruct Model fails at tgi 0.9.0

GPTQ quantization for MPT-30 models

Support custom hostnames in router

RuntimeError: weight encoder.embed_tokens.weight does not exist

Tied weight optimization for checkpoints doesn't work with text-generation-inference.

fix: fix CohereForAI/c4ai-command-r-plus

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard