text-generation-inference issues

Support for 4bit quantization

11

### Feature request It seems we now have support for loading models using 4bit quantization starting from bitsandbytes>=0.39.0 Link: [FP4 Quantization](https://huggingface.co/docs/transformers/main_classes/quantization#fp4-quantization) ### Motivation Running really large language models on smaller...

rahuldshetty

Stale

"TypeError: Descriptors cannot not be created directly"

3

### System Info I'am on a Ubuntu server of https://console.paperspace.com/ with this GPU : | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id...

ArnaudHureaux

Is there plan to support GPTQ?

1

### Feature request GPTQ is not supported yet in current version. Is there any timeline on that? ### Motivation quantization ### Your contribution can help with code if needed

zhch12121

Deploying Falcon to SageMaker TGI DLC after QLoRA fine-tuning

2

### Feature request Hi, I was able to deploy the base Falcon-40B model to SageMaker using the TGI DLC by following [this blog post](https://www.philschmid.de/sagemaker-falcon-llm) I also recently fine-tuned the Falcon-40B...

austinmw

feat(router): add ngrok integration

What do you think @Narsil? Maybe we can hide behind a cargo feature, but then it's a bit of mess in the docker container. We will need to build multiple...

OlivierDehaene

Falcon 40B slow inference

11

Currently, I am running Falcon quantized on 4 X Nvidia T4 GPUs, all running on the same system. I am getting `time_per_token` during inference of around 190 ms. Below is...

vempaliakhil96

Stale

NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Falcon-40b-instruct

5

### System Info 2023-06-15T16:56:34.095240Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.69.0 Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182 Docker label: sha-e7248fe nvidia-smi: Thu Jun 15 16:56:34 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver...

shshnk158

Safe Tensor converting fails for LLaMa 13B and 30B

10

### System Info ``` 2023-06-15T04:27:53.010592Z INFO text_generation_launcher: Runtime environment: [30/661] Target: x86_64-unknown-linux-gnu Cargo version: 1.69.0 Commit sha: 5ce89059f8149eaf313c63e9ded4199670cd74bb Docker label: sha-5ce8905 nvidia-smi: Thu Jun 15 04:27:51 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI...

jshin49

Adding some help for the options in `text-generation-benchmark`.

5

# What does this PR do? Fixes https://github.com/huggingface/text-generation-inference/issues/420 Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the...

Narsil

Exclude extraneous `.bin` files from `safetensors` conversion

2

### Feature request When users save their model with `Trainer.save_model()` or `Trainer.push_to_hub()`, the training arguments are saved as a pickle file called `training_arguments.bin`. This causes a problem during the `safetensors`...

lewtun

text-generation-inference
text-generation-inference copied to clipboard

Metadata

Support for 4bit quantization

"TypeError: Descriptors cannot not be created directly"

Is there plan to support GPTQ?

Deploying Falcon to SageMaker TGI DLC after QLoRA fine-tuning

feat(router): add ngrok integration

Falcon 40B slow inference

NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Falcon-40b-instruct

Safe Tensor converting fails for LLaMa 13B and 30B

Adding some help for the options in `text-generation-benchmark`.

Exclude extraneous `.bin` files from `safetensors` conversion

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard