gpt-fast
gpt-fast copied to clipboard
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
When `--compile` and `--compile_prefill` are enabled simultaneously, an error occurs: `RuntimeError: CUDA error: device-side assert triggered`. This issue was resolved by cloning a copy of `next_token` generated during the prefill...
I'm new to speculative decoding. When I was reading the speculative_decode code (https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L88), I have a few questions. Could you please help answer them? 1. When obtaining target_logits, the input_pos...
According to Mistral's paper the block size for Mistral-7B should be 8192 (ref: https://arxiv.org/pdf/2310.06825.pdf, https://huggingface.co/docs/transformers/en/model_doc/mistral). But currently it is set to the default value (2048).
Repro command: ``` python generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth ``` Errors: ``` (pt) [[email protected] ~/local/gpt-fast (main)]$ python generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth /home/ybliang/local/miniconda3/envs/pt/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #148 Summary: trying to fix the issue with kv_cache update by changing tracing into a tensor subclass. However it seems we have...
I really love this project and the accompanying blogpost, so thanks! I've reimplemented some of the inference techniques to speed up an implementation of Whisper that I am using. I...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #147 * #142 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Seems like requirements.txt is a bit obsolete. Also adding .venv and .vscode dirs to gitignore
I see that the linear layers weights are replaces with quantized weights. However, I don't see what happens to the bias in the linear layers? Is it not needed anymore?...
Summary: Only works for fp32 and fp16 types so that means it isn't providing much value right now. `convert_hf_checkpoint.py` can already directly generate an equivalent .pth checkpoint file without gguf...