gpt-fast icon indicating copy to clipboard operation
gpt-fast copied to clipboard

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Results 132 gpt-fast issues
Sort by recently updated
recently updated
newest added

Hi, I just did a quick implementation of gpt-fast and did an inference on the Llama-2 7B. I seem to get around 65 tokens per second on average without quantization....

Amaizing work! How was the benchmark results obtained? Is it just the generation speed measured when the GPU power limited 330W?

The existing code defines a variable `precision` ([here](https://github.com/pytorch-labs/gpt-fast/blob/3bcaaaf068d112d534f335ec21a17d7b8b5551bf/generate.py#L266)) which is then used in `_load_model()` [here](https://github.com/pytorch-labs/gpt-fast/blob/3bcaaaf068d112d534f335ec21a17d7b8b5551bf/generate.py#L230) to set the dtype for the model. However, this variable was not getting passed to...

CLA Signed

Load checkpoint directly to device, in my testing loading llama 7B went from 7.83 to 5.78 seconds (about 25% faster). I also noticed that memory usage doubles temporarily, at least...

CLA Signed

I want to use an interactive mode and use this command >python generate.py --compile --interactive --draft_checkpoint_path checkpoints/$DRAFT_MODEL_REPO/model_int8.pth --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --speculate_k 3 However, I got the following error, could you please...

Great blogpost! Is there any documentation on how `inductor` lowers the `ops` in the `fx graph` to actual kernels -- specifically the optimization / tuning that determines the actual kernel...

Running 13b chat model on L4 GPU with ``` python generate.py --checkpoint_path .../model_int4.g32.pth --compile --compile_prefill ``` An error happens ``` Traceback (most recent call last): File "/home/user/gpt-fast/generate.py", line 407, in...

performance loss for int4 compare with AWQ?