text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Large Language Model Text Generation Inference

Results 639 text-generation-inference issues
Sort by recently updated
recently updated
newest added

# What does this PR do? Models tested: * [x] Gemma3 * [x] Paligemma * [x] LlavaNext * [x] idefics2 * [ ] Idefics3 - `Device error - Failing independent...

- Fixes CPU affinity when running inference on CPU, and when CPUs are externally managed using taskset, numactl, cgroups, Kubernetes CPU manager, NRI resource policy plugins, for instance. - Detect...

### System Info I'm using `ghcr.io/huggingface/text-generation-inference:3.0.1` container image. ## Issue Description Hi everyone! I'm using the `Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8` LLM model for benchmarking with multiple concurrent requests. However, when I send 10...

### Description When using the inference client with function calling, models seem to never resolve their calls. As we know, typically, with the OpenAI pattern, the simplest function/tool call is...

### Feature request Add support for priority based queue in the tgi backend. A multi-level priority queue with arrival time tie-breaker for request scheduling in the TGI v3 backend. This...

### System Info Using Inference Endpoint here: https://endpoints.huggingface.co/m-ric/endpoints/qwen2-72b-instruct-psj ghcr.io/huggingface/text-generation-inference:3.0.1 ### Information - [ ] Docker - [x] The CLI directly ### Tasks - [x] An officially supported command - [...

i have deployed the google/gemma-3-27b-it model on 4 H100 GPUS, it only supports 23k context length, when i increased to support 128k context window as it supports, i endup with...

### Model description i have a problem on running gemma 3 12B-it on my server. i have 2 gpus [Quadro rtx-8000] . when i want to run the model in...

### System Info Trying to run tgi-neuron (or neuronx-tgi) on a inf2.xlarge instance on AWS. With the Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04). ### Information - [x] Docker...

### System Info Sagemaker Realtime Inference endpoints TGI Version 2.4.1 p4d: 4 A100, 96 CPU, 1152 GB mem MAX_INPUT_LENGTH: '16128' MAX_TOTAL_TOKENS: '16384' ### Information - [x] Docker - [ ]...