text-generation-inference issues

Add option to specify model datatype

3

### Feature request It would be great if we could specify the model's datatype as a command-line argument. ### Motivation For most PyTorch models, `torch_dtype` is currently set to `torch.float16`...

jackcook

Sagemaker entrypoint HF_MODEL_TRUST_REMOTE_CODE in config doesn't get recognized

1

### System Info versions: python 3.10 sagemaker 2.168.0 (latest) huggingface tgi 0.8.2 (latest) ### Reproduction I'm trying to deploy MPT-30B-instruct and WizardLM-Uncensored-Falcon-40b in SageMaker my config is ``` config =...

hlo-world

requests.exceptions.ConnectionError

3

### System Info model=gpt2 volume=$HOME/.cache/huggingface/hub num_shard=1 docker run --gpus all --shm-size 1g -p 8081:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.1 --model-id $model --num-shard $num_shard ### Information - [X] Docker - [ ] The...

taoari

Stale

Proper way for client to stop generate_stream

### Feature request Since the `/generate_stream` provide Token streaming using Server-Sent Events (SSE), the client cannot tell the server to stop the streaming. ### Motivation Sometime, when requesting very large...

axel7083

How can I know the progress of the model download?

1

The Falcon 40B model occupies nearly 90G of space, takes a long time to download, and does not know the download progress. Is there any other way to download the...

runningBolin

Compatible with the OpenAI api

### Feature request Do you have any suggestions on how to be more compatible with the OpenAI api? ### Motivation like https://github.com/go-skynet/LocalAI or https://github.com/lm-sys/FastChat ### Your contribution discuss

allenhaozi

Adding support for CTranslate2 acceleration

6

## Feature request The feature would be to support accelerated inference with the CTranslate2 framework https://github.com/OpenNMT/CTranslate2 ## Motivation Reasons to CTranslate2 #### faster float16 generation In my case, outperforms VLLM...

michaelfeil

Stale

Automatically map deduplicated safetensors weights to their original values

5

# What does this PR do? This PR automatically points tensors that were removed due to deduplication to their still existing twin. In `server.text_generation_server.utils.convert.py#convert_file`, tensors that have a value equal...

Vinno97

Unable to load GPTQ weights

5

### System Info Hi, I'm using the latest version of text-generation-inference (image sha-ae466a8) on Runpod => docker. When I try to load a GPTQ file from local disk with QUANTIZE...

ssmi153

Support for extended context for LlaMA Based models

7

### Feature request Came across this article `https://kaiokendev.github.io/til#extending-context-to-8k` that suggests by interpolation, we can potentially extend the context for the model. ``` # These two lines: self.scale = 1 /...

keelezibel

text-generation-inference
text-generation-inference copied to clipboard

Metadata

Add option to specify model datatype

Sagemaker entrypoint HF_MODEL_TRUST_REMOTE_CODE in config doesn't get recognized

requests.exceptions.ConnectionError

Proper way for client to stop generate_stream

How can I know the progress of the model download?

Compatible with the OpenAI api

Adding support for CTranslate2 acceleration

Automatically map deduplicated safetensors weights to their original values

Unable to load GPTQ weights

Support for extended context for LlaMA Based models

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard