gpt-fast
gpt-fast copied to clipboard
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
I loved seeing the blog post with a simple, standalone implementation of many techniques used in production to speed up LLMs. Would love to see this extended to MoE like...
Extend existing device variable to support code gen for other targets. New args to generate -- use_sdpa # by default, SDPA is disabled for best performance on CUDA. CPU presents...
this PR fixes the checkpoint conversion scripts for the `.safetensors` checkpoints on https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 This is necessary because 1. safetensors are loaded differently from pt files 2. the `.pt` files and...
I'm using torch.__version__ = 2.1.0a0+32f93b1 which doesn't have AttributeError: '_OpNamespace' 'aten' object has no attribute '_convert_weight_to_int4pack' What exactly does this do, and is it defined elsewhere? Unfortunately upgrading to the...
Added a scaling_factor to the rotary embedding calculation. This is for use with models like [DeepSeek](https://github.com/deepseek-ai/). DeepSeek uses LlamaLinearScalingRotaryEmbedding. The only difference is that the freqs in precompute_freqs_cis are divided...
`torch.compile` always re-compiles a function from scratch in a new Python session, which takes a lot of time. I'm wondering if there's a way to cache the compilation result in...
**Bug Report** **Description:** I encountered a bug when attempting to convert a model from Hugging Face (HF) using the provided code implementation. The issue appears to be related to counting...