text-generation-inference issues

Enable qwen2vl video

2

This PR is a work in progress that explores adding support for video inputs with Qwen2-VL. Thank you @mfarre for getting this effort started. TODOS - [X] suport `video_url`s -...

drbh

TGI auto max token not working for Llama-3 models

When deploying some Llama-3 model (e.g. nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) on TGI version 2.4.1, I have observed that the model always defaults to a maximum token size of 4096 tokens if the `max...

SeEngel

"sharded is not supported for AutoModel" Error When Deploying SageMaker Endpoint For Qwen 2.5 7B Trained via SageMaker

1

### System Info 1. Docker image: `763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.0-gpu-py310-cu121-ubuntu20.04` 2. Deployment: SageMaker endpoint, using the `HuggingFaceModel` object. ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [...

jjbuck

I encountered the same issue while using `baichuan2-13B-chat`..

I encountered the same issue while using `baichuan2-13B-chat`.. I extracted the chat parameters from baichuan2's [generation_config.json](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/generation_config.json), and when I call the tgi interface, the result is as follows. ![image](https://github.com/huggingface/text-generation-inference/assets/46644537/7e7f561c-f31e-43a0-8696-7ecd65fba9c5) When...

Lacacy

device-side assert triggered when trying to use LLaMA 3.2 Vision with grammar

2

### System Info Version: `text-generation-launcher 2.4.0` Environment: ``` Target: x86_64-unknown-linux-gnu Cargo version: 1.80.1 Commit sha: 0a655a0ab5db15f08e45d8c535e263044b944190 Docker label: sha-0a655a0 ``` Hardware: 4 x A100 ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.78 Driver...

SokolAnn

Get opentelemetry trace id from request headers instead of creating a new trace

4

# What does this PR do? Fixes #2376 I made a simple middleware to extract OpenTelemetry context (e.g. trace id, span id) from request headers. When valid transparent info is...

kozistr

"RuntimeError: weight lm_head.weight does not exist" quantizing Llama-3.2-11B-Vision-Instruct

### System Info Running official docker image: ghcr.io/huggingface/text-generation-inference:2.4.0 os: Linux 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux nvidia-smi: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version:...

akowalsk

Server error: transport error

6

### System Info We are deploying the model meta-llama/Meta-Llama-3.1-70B-Instruct with FP8 quantization and everything works perfectly for hours until the server crashes with this error: 2024-10-01T07:43:22.055987Z ERROR batch{batch_size=1}:prefill:prefill{id=290 size=1}:prefill{id=290 size=1}:...

ismael-dm

Latest Docker Image failing for A40 GPU

### System Info When testing TGI Docker on 2xA40 GPUs to load Llama3.1-70b in `eetq` quantization. I ran into a `CUDA illegal memory error` ### Information - [X] Docker -...

SMAntony

The same model, but different loading methods will result in very different inference speeds?

1

### System Info TGI version latest;single NVIDIA GeForce RTX 3090； ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [X] An officially supported command -...

hjs2027864933

text-generation-inference
text-generation-inference copied to clipboard

Metadata

Enable qwen2vl video

TGI auto max token not working for Llama-3 models

"sharded is not supported for AutoModel" Error When Deploying SageMaker Endpoint For Qwen 2.5 7B Trained via SageMaker

I encountered the same issue while using `baichuan2-13B-chat`..

device-side assert triggered when trying to use LLaMA 3.2 Vision with grammar

Get opentelemetry trace id from request headers instead of creating a new trace

"RuntimeError: weight lm_head.weight does not exist" quantizing Llama-3.2-11B-Vision-Instruct

Server error: transport error

Latest Docker Image failing for A40 GPU

The same model, but different loading methods will result in very different inference speeds?

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard