text-generation-inference issues

Serverless Inference API OpenAI /v1/chat/completions route broken

1

### System Info Trying to access the serverless inference endpoints using the OpenAI compatible route leads to status 400. ``` Invalid URL: missing field `name` ``` ### Information - [...

pelikhan

text-generation-inference:3.0.1 docker container timeout on image fetching from fastapi static files.

### System Info We are running the tgi container and a fastapi app that queries the model. I will refer to them as "tgi" and "llm-api". Both docker containers are...

dinoelT

AttributeError: no attribute 'model' when using llava-next with lora-adapters

### System Info #### versions: - text-generation-inference: latest docker image - os: Debian GNU/Linux 11 - model: llava-hf/llava-v1.6-mistral-7b-hf ### Information - [x] Docker - [ ] The CLI directly ###...

derkleinejakob

Support `reponse_format: {"type": "json_object"}` without any constrained schema

2

In other inference APIs, `response_format={"type": "json_object"}` restricts the model output to be a valid JSON object without enforcing a schema. Right now this is not supported: ``` Failed to deserialize...

lhoestq

CUDA: an illegal memory access was encountered with Mistral FP8 Marlin kernels on NVIDIA driver 535.216.01 (AWS Sagemaker Real-time Inference)

2

### System Info Tested with text-generation-inference 2.4.0 and 3.0.0 Docker containers running the CLI from within on Sagemaker Real-time Inference (NVIDIA driver 535.216.01) ### Information - [x] Docker - [x]...

dwyatte

Apply rope scaling from the config.json

### Feature request TGI should read the config.json and apply the rope scaling and factor from the config.json parameter. ### Motivation Many inference engines auto-apply the rope scaling and rope...

rjmehta1993

Support for openbmb/MiniCPM-o-2_6

### Model description **[MiniCPM-o-2_6](https://github.com/OpenBMB/MiniCPM-o)** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B...

myoss

Dynamically serve LoRA modules

1

### Feature request Do you plan on integrating dynamic serving of LoRA modules, so that new modules can be added / removed during runtime instead of having to restart the...

rikardradovac

Support XGrammar backend as an alternative to Outlines

### Feature request Support the use of XGrammar instead of Outlines for the backend Structured-Output generation. ### Motivation XGrammar has been shown to be much faster than Outlines for generation...

2016bgeyer

text-generation-inference
text-generation-inference copied to clipboard

Metadata

Serverless Inference API OpenAI /v1/chat/completions route broken

text-generation-inference:3.0.1 docker container timeout on image fetching from fastapi static files.

AttributeError: no attribute 'model' when using llava-next with lora-adapters

Support `reponse_format: {"type": "json_object"}` without any constrained schema

CUDA: an illegal memory access was encountered with Mistral FP8 Marlin kernels on NVIDIA driver 535.216.01 (AWS Sagemaker Real-time Inference)

Apply rope scaling from the config.json

Support for openbmb/MiniCPM-o-2_6

Dynamically serve LoRA modules

Support XGrammar backend as an alternative to Outlines

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard