text-generation-inference issues

Structured output doesn't work with open ai endpoint

1

### System Info response_format doesn't work with open ai endpoint, please add it ### Information - [x] Docker - [ ] The CLI directly ### Tasks - [x] An officially...

Stealthwriter

Starcoder2-15B model - AttributeError: 'TensorParallelColumnLinear' object has no attribute [rank3]: 'base_layer'

10

### System Info Using the below TGI version: ghcr.io/huggingface/text-generation-inference:3.0.1 Running on AWS g5.12xlarge instance (which is having 4 GPUs) model used: bigcode/starcoder2-15b-instruct-v0.1 Deployment: Using docker ### Information - [X] Docker...

ashwincv0112

Allow specifying `adapter_id` on `chat/completions` requests

4

### Feature request It seems that if i want to load a base model with an adapter and consume it, i'll have to use the `generate` route only which allows...

tsvisab

Mangled generation for string sequences containing`<space>'m` with Llama 3.1

4

### System Info We're running TGI with Llama 3.1 8b instruct, and observed some weird values when asking the LLM to generate strings containing the combination of letters `'m` (e.g....

tomjorquera

How to give custom model code for TGI to run.

Is there a way to give custom model inference code for TGI to run during invocation?

ashwani-bhat

Running Qwen2-VL-2B-Instruct on TGI is giving an error

### System Info ``` docker run --gpus all --shm-size 1g -p 8080:80 -e CUDA_VISIBLE_DEVICES=0,1,2,3 \ ghcr.io/huggingface/text-generation-inference:2.4.1 \ --model-id Qwen/Qwen2-VL-2B-Instruct --trust-remote-code \ --quantize bitsandbytes-nf4 --cuda-graphs 0 ``` The above command is...

ashwani-bhat

Does tgi support image resize for qwen2-vl pipeline?

1

### System Info I try to deploy a qwen2-vl fine-tuned model with tgi and vllm, and I've found some results between these two frameworks are different. Seems that tgi consume...

AHEADer

Tool Calling using Vercel's AI SDK not working as intended

3

### System Info `/info` Output: ```json { "model_id": "casperhansen/llama-3.3-70b-instruct-awq", "model_sha": "64d255621f40b42adaf6d1f32a47e1d4534c0f14", "model_pipeline_tag": "text-generation", "max_concurrent_requests": 128, "max_best_of": 2, "max_stop_sequences": 4, "max_input_tokens": 8191, "max_total_tokens": 8192, "validation_workers": 2, "max_client_batch_size": 4, "router": "text-generation-router", "version":...

kldzj

CUDA Out of memory when using the benchmarking tool with batch size greater than 1

### System Info - TGI v3.0.1 - OS: GCP Container-Optimized OS - 4xL4 GPUs (24GB memory each) - Model is `hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4` As soon as I run the TGI benchmarking tool...

mborisov-bi

RuntimeError: Cannot load 'awq' weight when running Qwen2-VL-72B-Instruct-AWQ model

1

### System Info Hi all, I encountered an issue when trying to run the Qwen/Qwen2-VL-72B-Instruct-AWQ model using the latest text-generation-inference Docker container (same issue with 3.0.1). The error message is...

edesalve

text-generation-inference
text-generation-inference copied to clipboard

Metadata

Structured output doesn't work with open ai endpoint

Starcoder2-15B model - AttributeError: 'TensorParallelColumnLinear' object has no attribute [rank3]: 'base_layer'

Allow specifying `adapter_id` on `chat/completions` requests

Mangled generation for string sequences containing`<space>'m` with Llama 3.1

How to give custom model code for TGI to run.

Running Qwen2-VL-2B-Instruct on TGI is giving an error

Does tgi support image resize for qwen2-vl pipeline?

Tool Calling using Vercel's AI SDK not working as intended

CUDA Out of memory when using the benchmarking tool with batch size greater than 1

RuntimeError: Cannot load 'awq' weight when running Qwen2-VL-72B-Instruct-AWQ model

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard