gpt-fast issues

Fix compile_prefill to prevent CUDA error

2

When `--compile` and `--compile_prefill` are enabled simultaneously, an error occurs: `RuntimeError: CUDA error: device-side assert triggered`. This issue was resolved by cloning a copy of `next_token` generated during the prefill...

PasserBy4

CLA Signed

Questions on Speculative Decoding in gpt-fast generate.py

2

I'm new to speculative decoding. When I was reading the speculative_decode code (https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L88), I have a few questions. Could you please help answer them? 1. When obtaining target_logits, the input_pos...

hxer7963

Fixing block size for Mistral-7B.

1

According to Mistral's paper the block size for Mistral-7B should be 8192 (ref: https://arxiv.org/pdf/2310.06825.pdf, https://huggingface.co/docs/transformers/en/model_doc/mistral). But currently it is set to the default value (2048).

Artyom17

CLA Signed

CUDA error if enabling compile_prefill for quantization model (int8)

8

Repro command: ``` python generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth ``` Errors: ``` (pt) [[email protected] ~/local/gpt-fast (main)]$ python generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth /home/ybliang/local/miniconda3/envs/pt/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node...

yanboliang

fixing GPTQ

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #148 Summary: trying to fix the issue with kv_cache update by changing tracing into a tensor subclass. However it seems we have...

HDCharles

CLA Signed

Question about large sequence length attention kernels

1

I really love this project and the accompanying blogpost, so thanks! I've reimplemented some of the inference techniques to speed up an implementation of Whisper that I am using. I...

loubbrad

fixing GPTQ

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #147 * #142 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

HDCharles

CLA Signed

Updating requirements.txt and .gitignore

Seems like requirements.txt is a bit obsolete. Also adding .venv and .vscode dirs to gitignore

Artyom17

CLA Signed

What happens to bias during int8 quantization?

3

I see that the linear layers weights are replaces with quantized weights. However, I don't see what happens to the bias in the linear layers? Is it not needed anymore?...

gchhablani

GGUF fp32/fp16 conversion to checkpoint

1

Summary: Only works for fp32 and fp16 types so that means it isn't providing much value right now. `convert_hf_checkpoint.py` can already directly generate an equivalent .pth checkpoint file without gguf...

mergennachin

CLA Signed

gpt-fast
gpt-fast copied to clipboard

Metadata

Fix compile_prefill to prevent CUDA error

Questions on Speculative Decoding in gpt-fast generate.py

Fixing block size for Mistral-7B.

CUDA error if enabling compile_prefill for quantization model (int8)

fixing GPTQ

Question about large sequence length attention kernels

fixing GPTQ

Updating requirements.txt and .gitignore

What happens to bias during int8 quantization?

GGUF fp32/fp16 conversion to checkpoint

← Metadata

Owner

Metadata

gpt-fast gpt-fast copied to clipboard

Metadata

← Metadata

Owner

Metadata

gpt-fast
gpt-fast copied to clipboard