text-generation-inference
                                
                                 text-generation-inference copied to clipboard
                                
                                    text-generation-inference copied to clipboard
                            
                            
                            
                        Large Language Model Text Generation Inference
This PR is a work in progress that explores adding support for video inputs with Qwen2-VL. Thank you @mfarre for getting this effort started. TODOS - [X] suport `video_url`s -...
When deploying some Llama-3 model (e.g. nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) on TGI version 2.4.1, I have observed that the model always defaults to a maximum token size of 4096 tokens if the `max...
### System Info 1. Docker image: `763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.0-gpu-py310-cu121-ubuntu20.04` 2. Deployment: SageMaker endpoint, using the `HuggingFaceModel` object. ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [...
I encountered the same issue while using `baichuan2-13B-chat`.. I extracted the chat parameters from baichuan2's [generation_config.json](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/generation_config.json), and when I call the tgi interface, the result is as follows.  When...
### System Info Version: `text-generation-launcher 2.4.0` Environment: ``` Target: x86_64-unknown-linux-gnu Cargo version: 1.80.1 Commit sha: 0a655a0ab5db15f08e45d8c535e263044b944190 Docker label: sha-0a655a0 ``` Hardware: 4 x A100 ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.78 Driver...
# What does this PR do? Fixes #2376 I made a simple middleware to extract OpenTelemetry context (e.g. trace id, span id) from request headers. When valid transparent info is...
### System Info Running official docker image: ghcr.io/huggingface/text-generation-inference:2.4.0 os: Linux 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux nvidia-smi: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version:...
### System Info We are deploying the model meta-llama/Meta-Llama-3.1-70B-Instruct with FP8 quantization and everything works perfectly for hours until the server crashes with this error: 2024-10-01T07:43:22.055987Z ERROR batch{batch_size=1}:prefill:prefill{id=290 size=1}:prefill{id=290 size=1}:...
### System Info When testing TGI Docker on 2xA40 GPUs to load Llama3.1-70b in `eetq` quantization. I ran into a `CUDA illegal memory error` ### Information - [X] Docker -...
### System Info TGI version latest;single NVIDIA GeForce RTX 3090; ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [X] An officially supported command -...