text-generation-inference issues

Constrained system/cpu RAM prohibits loading even with enough GPU Memory

2

### System Info The issue occurred with `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5` on a machine with 32GB RAM and a single RTX 6000 Ada (48GB), where the shard loading aborts, but loading with raw...

ptschandl

feat(server): pre-allocate past key values for flash causal LM

OlivierDehaene

Improve num_shard support with CUDA_VISIBLE_DEVICES=all

11

# What does this PR do? Make num_shards mirror available GPU's if CUDA_VISIBLE_DEVICES is set to "all". Setting CUDA_VISIBLE_DEVICES=all in a podman-based (cdi) setup effectively fails to use the GPU...

johnj

TransformerEngine FP8 speedup

3

### Feature request Please help me implementing the speedup generated by using the TransformerEngine of the hopper H100 GPUs https://github.com/NVIDIA/TransformerEngine https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html ### Motivation Inference speedup ### Your contribution I am...

SinanAkkoyun

QLora Support

7

### Feature request Add 4-bit quantization support when bitsandbytes releases. ### Motivation Run larger models easily and performantly ### Your contribution I could make a PR if this is a...

sam-h-bean

ERROR shard-manager When run bigcode/starcoder

3

### System Info ``` docker exec -it text-generation-inference text-generation-launcher --env ``` ``` (base) ➜ huggingface-text-generation-inference docker exec -it 401ba897d58aa498e6fffa0e717144c47fea4cf56c0578fbb4b384b42bcf6040 text-generation-launcher --env 2023-06-03T03:36:08.324157Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version:...

wpjscc

[python] fix: Fix BLOOM's embedding mapping for deepspeed chat

# What does this PR do? This PR makes BLOOM model trained on DeepSpeed Chat can be parallelized. DeepSpeed Chat saves checkpoint like "transformer.word_embedding.weight". so I got an error in...

hyunwoongko

Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

3

### System Info Centos 7 and docker 23.0.5. 8 T4 gpus driver is 515.65.1 ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [X] An...

CoinCheung

Do not init process group if already initialized

3

# What does this PR do? Simple tweak to skip initialization of the torch process group if one is already initialized. ## Before submitting - [ ] This PR fixes...

Yard1

Server error when running witout GPUs: attention_scores_2d must be a CUDA tensor","error_type":"generation"}

1

### System Info Deploying server as docker image in machine without GPU. Invocation of generation endpoint produces error: "error":"Request failed during generation: Server error: attention_scores_2d must be a CUDA tensor","error_type":"generation"}...

jsimao71

text-generation-inference
text-generation-inference copied to clipboard

Metadata

Constrained system/cpu RAM prohibits loading even with enough GPU Memory

feat(server): pre-allocate past key values for flash causal LM

Improve num_shard support with CUDA_VISIBLE_DEVICES=all

TransformerEngine FP8 speedup

QLora Support

ERROR shard-manager When run bigcode/starcoder

[python] fix: Fix BLOOM's embedding mapping for deepspeed chat

Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

Do not init process group if already initialized

Server error when running witout GPUs: attention_scores_2d must be a CUDA tensor","error_type":"generation"}

← Metadata

Owner

Metadata

text-generation-inference text-generation-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-generation-inference
text-generation-inference copied to clipboard