gpt-fast
gpt-fast copied to clipboard
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
I am trying to quantize the llama-2-7b-chat-hf using the gpt fast using:- python quantize.py --mode int4 --groupsize 32 on Kaggle using Kaggle T4*2 GPU. I have installed pytorch nightly using:-...
Everything works on my A6000s and A100s, but not on the older V100 (says compute capability is low). Are there plans to add support for the legacy devices? Thanks!
I'm attempting to deploy llama2-7b-chat-hf on a server equipped with two V100 GPUs linked by NVLink, but I've encountered an issue where the tokens per second (token/s) performance is worse...
Thanks for the amazing work! It really is super fast at bs=1. Can batch usecases, or dynamic batching be supported?
I run CodeLlama 7B: when I use FP16, bandwidth achieved is 700 GB/s; however, when I use INT8, it is 197 GB/s. I run the model on one AMD MI210...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #83 * #104 Summary: this makes gptq work Test Plan: Reviewers: Subscribers: Tasks: Tags:
I am using AMD MI210s. After loading the models, the following steps are extremely slow (see screenshot). It turned out the Compilation time is 270 seconds. Could you please help...
Hi, thanks for building this wonderful open-source project! I am using GPTQ to first quantize a llama2-7b-chat-hf model: ```bash python quantize.py --checkpoint_path checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --mode int4-gptq --calibration_tasks wikitext --calibration_seq_length 2048 ```...
Given a model compiled with: ```py model = torch.compile(model, mode="reduce-overhead", fullgraph=True, dynamic=True) ``` where the bulk of the task is computing next token logits for different prompts (MMLU), memory usage...
Many papers have recently addressed the issues with quantization of activations for LLMs. Examples: https://github.com/ziplab/QLLM?tab=readme-ov-file#%F0%9F%9B%A0-install https://github.com/mit-han-lab/lmquant?tab=readme-ov-file#efficiency-benchmarks https://github.com/spcl/QuaRot Is it possible to add activation quantization support to gpt-fast for even more...