gpt-fast icon indicating copy to clipboard operation
gpt-fast copied to clipboard

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Results 132 gpt-fast issues
Sort by recently updated
recently updated
newest added

I was comparing the rotary embedding implementation in this repository with the implementations in the official Llama and Deepseek repositories using this Jupyter notebook: [link](https://colab.research.google.com/drive/1I9aBN55UUgmUwSNTmELC1u7DWuEk1dU2?usp=sharing). In Llama and Deepseek repositories,...

``` (/home/bobren/local/a/pytorch-env) [15:08] devgpu035:/home/bobren/local/a/gpt-fast python eval.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth Loading model ... Time to load model: 6.96 seconds. README.md: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.84k/6.84k [00:00

我看源码中需要使用到model.py文件中的modelArgs,但我想要在DeepSeek的模型上使用这个方法,该如何实践?

I tried the following and seems it breaks right now ``` > python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 --groupsize 64 Loading model ... Quantizing model weights for int4 weight-only affine...

When using int8 quantization, there is a significant performance drop in multi-batch inference compared to single-batch inference. The single-batch performance is good, but the performance doesn't scale well with increased...

Summary: adding torchao apis to gpt-fast and some minor tweaks Test Plan: (in progress) export MODEL_REPO=meta-llama/Meta-Llama-3-8B python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode torchao-int8 python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_torchao-int8.pth --compile python generate.py --checkpoint_path...

CLA Signed

Hey, thanks for providing the gpt-fast project, I am getting an error when trying to run inference. I have fine-tuned a llama-3.1-70B model with LoRa, using torchtune, converted the checkpoint...

Has anyone run this code with bs>1 and speculatively?

Am I right that here is a mistake? In 191 line generate.py Because for batch>1 cur_token will have more than 1 element so next_token.view(()) will give an error. ``` if...

When I use meta-llama/Llama-3.2-1B Can it be fixed? ``` RuntimeError: Error(s) in loading state_dict for Transformer: Missing key(s) in state_dict: "tok_embeddings.weight", "layers.0.attention.wqkv.weight", "layers.0.attention.wo.weight", "layers.0.feed_forward.w1.weight", "layers.0.feed_forward.w3.weight", "layers.0.feed_forward.w2.weight", "layers.0.ffn_norm.weight", "layers.0.attention_norm.weight", "layers.1.attention.wqkv.weight", "layers.1.attention.wo.weight",...