text-generation-inference issues

Different inference results and speed between /generate and OpenAI endpoint

2

### System Info Running docker image version 2.4.0 with eetq quantization Model: microsoft/Phi-3.5-mini-instruct ``` {"model_id":"microsoft/Phi-3.5-mini-instruct","model_sha":"af0dfb8029e8a74545d0736d30cb6b58d2f0f3f0","model_pipeline_tag":"text-generation","max_concurrent_requests":128,"max_best_of":2,"max_stop_sequences":4,"max_input_tokens":2048,"max_total_tokens":4096,"validation_workers":2,"max_client_batch_size":4,"router":"text-generation-router","version":"2.4.0","sha":"0a655a0ab5db15f08e45d8c535e263044b944190","docker_label":"sha-0a655a0"} ``` Hardware: Google Kubernetes engine, L4 GPU ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07...

jegork

Regression in 2.4.0 : Input Valdidation errors return code 200 and do not return the error message

### System Info System: `Linux 4.18.0-553.22.1.el8_10.x86_64 #1 SMP Wed Sep 25 09:20:43 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux` `Rocky Linux 8.10` Hardware: GPU: `NVIDIA A100-SXM4-80GB` CPU: Architecture: x86_64 CPU op-mode(s):...

leonarddls

CUDA OutOfMemory even after warmup phase succeeded

We are running LLaMa 3.1 70B on 2 A100 GPUs with 80GB of RAM each. From the logs we see that warmup phase succeeded finding the right `max_batch_total_tokens` and that...

martinigoyanes

Support for Falcon-Mamba-7B

1

### Model description Hi I'm interested in adding support for Falcon-Mamba 7B to TGI, Here are some links for this model: paper: https://arxiv.org/abs/2410.05355 model: https://huggingface.co/tiiuae/falcon-mamba-7b ### Open source status -...

mokeddembillel

In dev mode, server is stuck at Server started at unix:///tmp/text-generation-server-0

### System Info Using prefix caching = True Using Attention = flashinfer WARNING 11-10 11:16:48 ray_utils.py:46] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install...

mokeddembillel

Failed to build vllm in local install

### System Info Hi all, I was installing from source and I got this error: Building wheels for collected packages: vllm Building editable for vllm (pyproject.toml) ... error error: subprocess-exited-with-error...

mokeddembillel

TGI crashes while loading Qwen2-VL-7B-Instruct

1

### System Info ``` 2024-11-06T04:38:58.950145Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.80.1 Commit sha: b1f9044d6cf082423a517cf9a6aa6e5ebd34e1c2 Docker label: sha-b1f9044 nvidia-smi: Wed Nov 6 04:38:58 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03...

ktobah

text-generation-inference
text-generation-inference copied to clipboard

Metadata

Different inference results and speed between /generate and OpenAI endpoint

Regression in 2.4.0 : Input Valdidation errors return code 200 and do not return the error message

CUDA OutOfMemory even after warmup phase succeeded

Support for Falcon-Mamba-7B

In dev mode, server is stuck at Server started at unix:///tmp/text-generation-server-0

Failed to build vllm in local install

TGI crashes while loading Qwen2-VL-7B-Instruct

Bi-gram Repetation Penalty for the TGI configuration

launch TGI with the argument `--max-input-tokens` smaller than sliding_window=4096 (got here max_input_tokens=16384)

Unable to load/run LoRA Adapters on llama - 7B

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard