text-generation-inference issues

`HUGGING_FACE_HUB_TOKEN` not exported in Sagemaker entrypoint

2

### System Info - AWS `sagemaker` 2.163.0 - g5.12xlarge instance type with 4 NVIDIA A10G GPUs and 96GB of GPU memory ### Information - [X] Docker - [ ] The...

mspronesti

Support PagedAttention

6

### Feature request [vLLM](https://github.com/vllm-project/vllm) is fast with efficient management of attention key and value memory with PagedAttention, serving higher throughput than TGI. ### Motivation Adopting PagedAttention would increase throughput and...

Atry

feat(server): use encoding to get prefill tokens

1

OlivierDehaene

Add integration test

# What does this PR do? Adds an integration test on llama-7b-gptq Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs...

Narsil

Option for use_fast tokenizer

7

### Feature request Add an additional option to specify `use_fast` flag for `AutoTokenizer`. ### Motivation Some models have slightly different behavior, or buggy versions, of slow or fast tokenizer. It...

psinger

Deploying fine tuned Falcon 7B model onto SageMaker yields download errors

6

### System Info Following the guide given on https://huggingface.co/blog/sagemaker-huggingface-llm, trying to deploy a fine tuned Falcon 7B model yields the following errors: ``` Error: DownloadError File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 84, in...

BaiqingL

Falcon 40B Instruct not generating in stream mode.

1

I encountered an issue while using the Falcon 40B Instruct model. Here are the steps I followed: I instantiated the model using the following command: ``` docker run --gpus all...

ckanaar

Changing convert logic.

Should be more robust to shared tensors (ok when using `from_pretrained). But forcing us to add new checks in our loading code (since the chosen key to keep might be...

Narsil

Cannot deploy Falcon-40B-instruct server because of low fixed timeout on startup

14

### System Info ## Problem Using the 0.8 (0.8.2) container with `--model-id tiiuae/falcon-40b-instruct --num-shard 2` on runpod.io with 2xA100 80GB On startup it starts loading the 2 shards but they...

mzperix

Stale

Support for mosaicml/mpt-30b-instruct model

17

### Feature request I was wondering if there will be a support for the newly released [mpt-30b-instruct](https://huggingface.co/mosaicml/mpt-30b-instruct) ### Motivation It's not possible to use `mosaicml/mpt-30b-instruct` model: `ValueError: sharded is not...

maziyarpanahi

text-generation-inference
text-generation-inference copied to clipboard

Metadata

`HUGGING_FACE_HUB_TOKEN` not exported in Sagemaker entrypoint

Support PagedAttention

feat(server): use encoding to get prefill tokens

Add integration test

Option for use_fast tokenizer

Deploying fine tuned Falcon 7B model onto SageMaker yields download errors

Falcon 40B Instruct not generating in stream mode.

Changing convert logic.

Cannot deploy Falcon-40B-instruct server because of low fixed timeout on startup

Support for mosaicml/mpt-30b-instruct model

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard