text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Large Language Model Text Generation Inference

Results 639 text-generation-inference issues
Sort by recently updated
recently updated
newest added

### System Info The issue occurred with `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5` on a machine with 32GB RAM and a single RTX 6000 Ada (48GB), where the shard loading aborts, but loading with raw...

# What does this PR do? Make num_shards mirror available GPU's if CUDA_VISIBLE_DEVICES is set to "all". Setting CUDA_VISIBLE_DEVICES=all in a podman-based (cdi) setup effectively fails to use the GPU...

### Feature request Please help me implementing the speedup generated by using the TransformerEngine of the hopper H100 GPUs https://github.com/NVIDIA/TransformerEngine https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html ### Motivation Inference speedup ### Your contribution I am...

### Feature request Add 4-bit quantization support when bitsandbytes releases. ### Motivation Run larger models easily and performantly ### Your contribution I could make a PR if this is a...

### System Info ``` docker exec -it text-generation-inference text-generation-launcher --env ``` ``` (base) ➜ huggingface-text-generation-inference docker exec -it 401ba897d58aa498e6fffa0e717144c47fea4cf56c0578fbb4b384b42bcf6040 text-generation-launcher --env 2023-06-03T03:36:08.324157Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version:...

# What does this PR do? This PR makes BLOOM model trained on DeepSpeed Chat can be parallelized. DeepSpeed Chat saves checkpoint like "transformer.word_embedding.weight". so I got an error in...

### System Info Centos 7 and docker 23.0.5. 8 T4 gpus driver is 515.65.1 ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [X] An...

# What does this PR do? Simple tweak to skip initialization of the torch process group if one is already initialized. ## Before submitting - [ ] This PR fixes...

### System Info Deploying server as docker image in machine without GPU. Invocation of generation endpoint produces error: "error":"Request failed during generation: Server error: attention_scores_2d must be a CUDA tensor","error_type":"generation"}...