gpt-fast
gpt-fast copied to clipboard
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Hi maintainers @yanboliang @Chillee , I saw Int8 Weight-Only Quantization is enabled in Mixtral 8x7B. And the next step should be supporting int4 and int4-gptq. May I know the timeline...
…them and remapping their keys Integrating loading, merging, and remapping into one step reduces the overall processing time by minimizing redundant operations. Pre-compiling the regular expression used for identifying and...
Hi! I tried to convert `princeton-nlp/Sheared-LLaMA-1.3B-ShareGPT` but it failed: ``` ❯ ./scripts/prepare.sh $MODEL_REPO (gptfast) README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1.37k/1.37k [00:00
I can run the script successfully as explained in the repository, such as creating a quantized model and then running it with generate.py. However, the actual issue arises when I...
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #102 Summary: att Adding this for accuracy evaluation, we also added this in executorch repo and we'll dedup later Test Plan: quantization:...
Hi! How does ppl compare between fp16 and your int4?
I write a simple test to get the triton code of `WeightOnlyInt8Linear`,the test code is as follows: ``` import torch import torch.nn as nn import torch.nn.functional as F class WeightOnlyInt8Linear(torch.nn.Module):...