text-generation-inference issues

feat(router): arg validation

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.

5

### Feature request Longer context up to 8k tokens, the given discussion and notebook generate promising results ### Motivation Discussion: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ Colab Notebook: https://colab.research.google.com/drive/1VI2nhlyKvd5cw4-zHvAIk00cAVj2lCCC#scrollTo=d2ceb547 ### Your contribution As it's only...

flozi00

Non flash MPT.

# What does this PR do? This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention....

Narsil

Add the option to force another dtype than `f16`.

# What does this PR do? Adds a new flag propagated everywhere. Disjoint from `--quantize` which also changes the actual dtype of layers. Fixes #490 Fixes # (issue) ## Before...

Narsil

Question: How to estimate memory requirements for a certain batch size/

2

I was just wondering how the GPU memory requirements vary depending on model size/batch size of request/max tokens. In doing some experiments where I needed the server to keep running...

vaishakkrishna

Guidance acceleration

8

### Feature request [Guidance](https://github.com/microsoft/guidance) can control the generated format, which could be a nice feature if it is built-in - Add extra parameter to `/generate` and `/generate_stream` protocol to specify...

Atry

Stale

Server error: cublasLt ran into an error!

1

### System Info Target: x86_64-unknown-linux-gnu Cargo version: 1.69.0 Commit sha: N/A Docker label: N/A nvidia-smi: Wed Jun 28 20:17:18 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0...

karlbernard2

text-generation-inference
text-generation-inference copied to clipboard

Metadata

feat(router): arg validation

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.

Non flash MPT.

Add the option to force another dtype than `f16`.

Question: How to estimate memory requirements for a certain batch size/

Guidance acceleration

Server error: cublasLt ran into an error!

First upgrade pip then run pip

Allow HUGGINGFACE_HUB_TOKEN to be passed in as a container parameter/argument

OpenLLama Orca mini model Expected (head_size % 8 == 0) && (head_size <= 128) to be true

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard