gpt-fast
                                
                                 gpt-fast copied to clipboard
                                
                                    gpt-fast copied to clipboard
                            
                            
                            
                        Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Hi, I just did a quick implementation of gpt-fast and did an inference on the Llama-2 7B. I seem to get around 65 tokens per second on average without quantization....
so that every one can use it out-of-box?
Amaizing work! How was the benchmark results obtained? Is it just the generation speed measured when the GPU power limited 330W?
The existing code defines a variable `precision` ([here](https://github.com/pytorch-labs/gpt-fast/blob/3bcaaaf068d112d534f335ec21a17d7b8b5551bf/generate.py#L266)) which is then used in `_load_model()` [here](https://github.com/pytorch-labs/gpt-fast/blob/3bcaaaf068d112d534f335ec21a17d7b8b5551bf/generate.py#L230) to set the dtype for the model. However, this variable was not getting passed to...
Load checkpoint directly to device, in my testing loading llama 7B went from 7.83 to 5.78 seconds (about 25% faster). I also noticed that memory usage doubles temporarily, at least...
I want to use an interactive mode and use this command >python generate.py --compile --interactive --draft_checkpoint_path checkpoints/$DRAFT_MODEL_REPO/model_int8.pth --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --speculate_k 3 However, I got the following error, could you please...
Great blogpost! Is there any documentation on how `inductor` lowers the `ops` in the `fx graph` to actual kernels -- specifically the optimization / tuning that determines the actual kernel...
Running 13b chat model on L4 GPU with ``` python generate.py --checkpoint_path .../model_int4.g32.pth --compile --compile_prefill ``` An error happens ``` Traceback (most recent call last): File "/home/user/gpt-fast/generate.py", line 407, in...
performance loss for int4 compare with AWQ?