text-generation-inference issues

feat: improve tools to include name and add tests

9

This PR makes tool calling aware of the name of the function selected. Fixes: https://github.com/huggingface/text-generation-inference/issues/1657 Thank you @puppetm4st3r for the helpful snippets, large parts of this PR are simply refactors...

drbh

Adding Llava-Next (Llava 1.6) with full support.

4

# What does this PR do? - Changed all models to extract `embed_tokens` in order to enable llava to separately call the embeddings and the core model layers. - Added...

Narsil

Dev/mask ldconfig output v2

wrap text-generation-launcher in docker image mask ldconfig failures to user (no need in most cases anyway)

oOraph

feat: explore torch.compile on MLP and other ops

WIP This PR explores the differences using torch.compile on select ops and starte work on reproducible benches

drbh

New NVIDIA partnership: TE inference speedup

2

### Feature request Hello, thank you for all the work! With the new NVIDIA partnership supplying H100 GPUs, could you please implement FP8 TransformerEngine speedup? ### Motivation That would mean...

SinanAkkoyun

Stale

feat: accept list as prompt and use first string

3

This PR allows the `CompletionRequest.prompt` to be sent as a string or array of strings. When an array is sent the first value will be used if it's a string;...

drbh

Out of memory when launching server with larger values for --max-batch-prefill-tokens yet GPU memory is available

5

### System Info ``` text-generation-launcher --env 2024-04-01T20:49:45.871764Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.75.0 Commit sha: e6bb3ff81fd670ad2f54904676f8165367dd47f8 Docker label: sha-e6bb3ff ``` ### Information - [X] Docker - [...

spew

Stale

Support exllamav2 (exl2) quantized models models

5

### Feature request Add support for exl2 quantization format via argument --quantization exl2 which will allow to load exllamav2 quantized models with various quantization schemes (not GPTQ). ### Motivation There...

sapountzis

`/tokenize` - Optionally Apply Chat Template before Tokenization

4

### Feature request On the `/tokenize` endpoint of TGI, add an option to apply the chat template from the model's tokenizer, if existant, before tokenizing. ### Motivation The `/tokenize` endpoint...

elsell

when run llama using TGI, Server error

2

### System Info 最新版本docker ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [ ] An officially supported command - [ ] My own modifications...

QianguoS

text-generation-inference
text-generation-inference copied to clipboard

Metadata

feat: improve tools to include name and add tests

Adding Llava-Next (Llava 1.6) with full support.

Dev/mask ldconfig output v2

feat: explore torch.compile on MLP and other ops

New NVIDIA partnership: TE inference speedup

feat: accept list as prompt and use first string

Out of memory when launching server with larger values for --max-batch-prefill-tokens yet GPU memory is available

Support exllamav2 (exl2) quantized models models

`/tokenize` - Optionally Apply Chat Template before Tokenization

when run llama using TGI, Server error

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard