gpt-fast issues

Support Mixtral-8x7B

This is based on #57. Please checkout https://github.com/yanboliang/gpt-fast/tree/mixtral-moe to try this. Performance numbers (tokens/second): ``` | | 1 GPU | 2 GPU | 8 GPU | |------------------|---------|-----------|-------------| |baseline(bfloat16)| OOM |...

yanboliang

CLA Signed

Device-side assertions’ error when speculative decoding with different length of prompts.

I am running the speculative sampling task with the ‘compile’ mode of the generate.py script. The original speculative decoding version of gpt-fast decodes one prompt several times, but I want...

ZipECHO

'Triton Error [CUDA]: device kernel image is invalid' while compiling

2

**GPU: V100 CUDA: 11.8** has changed all the torch.bfloat16 to torch.float16 as stated in #49 . Is there something i still missing? **Error Log:** root@84bb9affda66:/workspace/gpt-fast/gpt-fast-main# CUDA_VISIBLE_DEVICES=1 python generate.py --compile --checkpoint_path...

Armod-I

8 (or 2 more) X A100 GPUs Model Output is Garbled and Failure to Terminate the Program Properly (One GPU is Correct)

6

# Model Output is Garbled When using Multi A100 GPUs (8 (or 2 more) X A100) and Failure to Terminate the Program Properly ## Environment - Ubuntu 20.04 - Python...

qianghuangwhu

added presets for mistral7b

5

+ option to specify name of model on command line

alvion427

CLA Signed

Check PyTorch version

1

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #67 * #66 When the installed PyTorch version is not nightly: - Issue an warning if `torch.compile` is requested. - Raise an...

yifuwang

CLA Signed

repeat sentence and non-complete sentence in the end

Firstly, thanks for your wonderful work. I have a question here. For production ready purpose of gpt-fast. Currently, repeat penalty or stop string and more function is not including in...

allen-ash

RuntimeError: cutlassF: no kernel found to launch!

15

root@md:/home/projects/gpt-fast# CUDA_VISIBLE_DEVICES=0 python3 generate.py --compile --checkpoint_path /models/huggingface_models/meta-Llama-2-7b-hf/model_int8.pth --max_new_tokens 100 Loading model ... Using int8 weight-only quantization! /opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage...

goodboyyes2009

torch.compile() with flash decoding ops

7

I'm trying to replace [F.scaled_dot_product_attention](https://github.com/pytorch-labs/gpt-fast/blob/db7b273ab86b75358bd3b014f1f022a19aba4797/model.py#L182) with [flash decoding kernel](https://pytorch.org/blog/flash-decoding/) for faster inference. However, while the flash decoding function works well in the eager mode, I cannot make it work with...

rayleizhu

Understanding why TorchInductor cannot speed-up huggingface transformer inference

1

## Problem `torch.compile()` shows an impressive ~2x speed-up for this code repo, but when applying to huggingface transformers there is barely no speed-up. I want to understand why, and then...

learning-chip

gpt-fast
gpt-fast copied to clipboard

Metadata

Support Mixtral-8x7B

Device-side assertions’ error when speculative decoding with different length of prompts.

'Triton Error [CUDA]: device kernel image is invalid' while compiling

8 (or 2 more) X A100 GPUs Model Output is Garbled and Failure to Terminate the Program Properly (One GPU is Correct)

added presets for mistral7b

Check PyTorch version

repeat sentence and non-complete sentence in the end

RuntimeError: cutlassF: no kernel found to launch!

torch.compile() with flash decoding ops

Understanding why TorchInductor cannot speed-up huggingface transformer inference

← Metadata

Owner

Metadata

gpt-fast gpt-fast copied to clipboard

Metadata

← Metadata

Owner

Metadata

gpt-fast
gpt-fast copied to clipboard