text-generation-inference issues

WIP: Add VLM transformers backend

# What does this PR do? Models tested: * [x] Gemma3 * [x] Paligemma * [x] LlavaNext * [x] idefics2 * [ ] Idefics3 - `Device error - Failing independent...

mht-sharma

Fix CPU and memory affinity under external resource management

1

- Fixes CPU affinity when running inference on CPU, and when CPUs are externally managed using taskset, numactl, cgroups, Kubernetes CPU manager, NRI resource policy plugins, for instance. - Detect...

askervin

[Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8] Bad Responses with High Concurrent Requests

3

### System Info I'm using `ghcr.io/huggingface/text-generation-inference:3.0.1` container image. ## Issue Description Hi everyone! I'm using the `Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8` LLM model for benchmarking with multiple concurrent requests. However, when I send 10...

michaelact

Function/tool calling never resolves

11

### Description When using the inference client with function calling, models seem to never resolve their calls. As we know, typically, with the OpenAI pattern, the simplest function/tool call is...

awmartin

Support for priority based queueing in the backend queue

### Feature request Add support for priority based queue in the tgi backend. A multi-level priority queue with arrival time tie-breaker for request scheduling in the TGI v3 backend. This...

ziadmoubayed

Image eats up way too many tokens

2

### System Info Using Inference Endpoint here: https://endpoints.huggingface.co/m-ric/endpoints/qwen2-72b-instruct-psj ghcr.io/huggingface/text-generation-inference:3.0.1 ### Information - [ ] Docker - [x] The CLI directly ### Tasks - [x] An officially supported command - [...

aymeric-roucher

google/gemma-3-27b-it context lenght issue

6

i have deployed the google/gemma-3-27b-it model on 4 H100 GPUS, it only supports 23k context length, when i increased to support 128k context window as it supports, i endup with...

nskpro-cmd

does tgi support Gemma 3 models?

1

### Model description i have a problem on running gemma 3 12B-it on my server. i have 2 gpus [Quadro rtx-8000] . when i want to run the model in...

Behnamhb

Tool use does not work on Neuron backend

### System Info Trying to run tgi-neuron (or neuronx-tgi) on a inf2.xlarge instance on AWS. With the Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04). ### Information - [x] Docker...

LouisHernandez17

Inexplicable 'incomplete generation' error

2

### System Info Sagemaker Realtime Inference endpoints TGI Version 2.4.1 p4d: 4 A100, 96 CPU, 1152 GB mem MAX_INPUT_LENGTH: '16128' MAX_TOTAL_TOKENS: '16384' ### Information - [x] Docker - [ ]...

mwm5945

text-generation-inference
text-generation-inference copied to clipboard

Metadata

WIP: Add VLM transformers backend

Fix CPU and memory affinity under external resource management

[Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8] Bad Responses with High Concurrent Requests

Function/tool calling never resolves

Support for priority based queueing in the backend queue

Image eats up way too many tokens

google/gemma-3-27b-it context lenght issue

does tgi support Gemma 3 models?

Tool use does not work on Neuron backend

Inexplicable 'incomplete generation' error

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard