text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Large Language Model Text Generation Inference

Results 639 text-generation-inference issues
Sort by recently updated
recently updated
newest added
trafficstars

### System Info response_format doesn't work with open ai endpoint, please add it ### Information - [x] Docker - [ ] The CLI directly ### Tasks - [x] An officially...

### System Info Using the below TGI version: ghcr.io/huggingface/text-generation-inference:3.0.1 Running on AWS g5.12xlarge instance (which is having 4 GPUs) model used: bigcode/starcoder2-15b-instruct-v0.1 Deployment: Using docker ### Information - [X] Docker...

### Feature request It seems that if i want to load a base model with an adapter and consume it, i'll have to use the `generate` route only which allows...

### System Info We're running TGI with Llama 3.1 8b instruct, and observed some weird values when asking the LLM to generate strings containing the combination of letters `'m` (e.g....

Is there a way to give custom model inference code for TGI to run during invocation?

### System Info ``` docker run --gpus all --shm-size 1g -p 8080:80 -e CUDA_VISIBLE_DEVICES=0,1,2,3 \ ghcr.io/huggingface/text-generation-inference:2.4.1 \ --model-id Qwen/Qwen2-VL-2B-Instruct --trust-remote-code \ --quantize bitsandbytes-nf4 --cuda-graphs 0 ``` The above command is...

### System Info I try to deploy a qwen2-vl fine-tuned model with tgi and vllm, and I've found some results between these two frameworks are different. Seems that tgi consume...

### System Info `/info` Output: ```json { "model_id": "casperhansen/llama-3.3-70b-instruct-awq", "model_sha": "64d255621f40b42adaf6d1f32a47e1d4534c0f14", "model_pipeline_tag": "text-generation", "max_concurrent_requests": 128, "max_best_of": 2, "max_stop_sequences": 4, "max_input_tokens": 8191, "max_total_tokens": 8192, "validation_workers": 2, "max_client_batch_size": 4, "router": "text-generation-router", "version":...

### System Info - TGI v3.0.1 - OS: GCP Container-Optimized OS - 4xL4 GPUs (24GB memory each) - Model is `hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4` As soon as I run the TGI benchmarking tool...

### System Info Hi all, I encountered an issue when trying to run the Qwen/Qwen2-VL-72B-Instruct-AWQ model using the latest text-generation-inference Docker container (same issue with 3.0.1). The error message is...