gpt-fast icon indicating copy to clipboard operation
gpt-fast copied to clipboard

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Results 132 gpt-fast issues
Sort by recently updated
recently updated
newest added

This is based on #57. Please checkout https://github.com/yanboliang/gpt-fast/tree/mixtral-moe to try this. Performance numbers (tokens/second): ``` | | 1 GPU | 2 GPU | 8 GPU | |------------------|---------|-----------|-------------| |baseline(bfloat16)| OOM |...

CLA Signed

I am running the speculative sampling task with the ‘compile’ mode of the generate.py script. The original speculative decoding version of gpt-fast decodes one prompt several times, but I want...

**GPU: V100 CUDA: 11.8** has changed all the torch.bfloat16 to torch.float16 as stated in #49 . Is there something i still missing? **Error Log:** root@84bb9affda66:/workspace/gpt-fast/gpt-fast-main# CUDA_VISIBLE_DEVICES=1 python generate.py --compile --checkpoint_path...

# Model Output is Garbled When using Multi A100 GPUs (8 (or 2 more) X A100) and Failure to Terminate the Program Properly ## Environment - Ubuntu 20.04 - Python...

+ option to specify name of model on command line

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #67 * #66 When the installed PyTorch version is not nightly: - Issue an warning if `torch.compile` is requested. - Raise an...

CLA Signed

Firstly, thanks for your wonderful work. I have a question here. For production ready purpose of gpt-fast. Currently, repeat penalty or stop string and more function is not including in...

root@md:/home/projects/gpt-fast# CUDA_VISIBLE_DEVICES=0 python3 generate.py --compile --checkpoint_path /models/huggingface_models/meta-Llama-2-7b-hf/model_int8.pth --max_new_tokens 100 Loading model ... Using int8 weight-only quantization! /opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage...

I'm trying to replace [F.scaled_dot_product_attention](https://github.com/pytorch-labs/gpt-fast/blob/db7b273ab86b75358bd3b014f1f022a19aba4797/model.py#L182) with [flash decoding kernel](https://pytorch.org/blog/flash-decoding/) for faster inference. However, while the flash decoding function works well in the eager mode, I cannot make it work with...

## Problem `torch.compile()` shows an impressive ~2x speed-up for this code repo, but when applying to huggingface transformers there is barely no speed-up. I want to understand why, and then...