gpt-fast issues

int4/int4-gptq support in Mixtral 8x7B

2

Hi maintainers @yanboliang @Chillee , I saw Int8 Weight-Only Quantization is enabled in Mixtral 8x7B. And the next step should be supporting int4 and int4-gptq. May I know the timeline...

yanbing-j

Update to use torch.nn.attention.sdpa_kernel

2

yanboliang

CLA Signed

Optimized the process of loading PyTorch state dictionaries, merging …

2

…them and remapping their keys Integrating loading, merging, and remapping into one step reduces the overall processing time by minimizing redundant operations. Pre-compiling the regular expression used for identifying and...

hvaria

CLA Signed

[example] Added gemma support

5

Chillee

CLA Signed

generate.py: do not use args in function main

guoyejun

CLA Signed

index out of range: No transformer config could be loaded

1

Hi! I tried to convert `princeton-nlp/Sheared-LLaMA-1.3B-ShareGPT` but it failed: ``` ❯ ./scripts/prepare.sh $MODEL_REPO (gptfast) README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1.37k/1.37k [00:00

SinanAkkoyun

Reducing Latency in Application with Torch Compilation: Initialization and Inference Optimization

I can run the script successfully as explained in the repository, such as creating a quantized model and then running it with generate.py. However, the actual issue arises when I...

daniyal214

[quant] Add int8 per token dynamic quant + int4 per group quant for ExecuTorch

1

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #102 Summary: att Adding this for accuracy evaluation, we also added this in executorch repo and we'll dedup later Test Plan: quantization:...

jerryzh168

CLA Signed

Int4 perplexity

Hi! How does ppl compare between fp16 and your int4?

SinanAkkoyun

Question about the gennerated code of `WeightOnlyInt8Linear`

6

I write a simple test to get the triton code of `WeightOnlyInt8Linear`，the test code is as follows: ``` import torch import torch.nn as nn import torch.nn.functional as F class WeightOnlyInt8Linear(torch.nn.Module):...

feiyuvl

gpt-fast
gpt-fast copied to clipboard

Metadata

int4/int4-gptq support in Mixtral 8x7B

Update to use torch.nn.attention.sdpa_kernel

Optimized the process of loading PyTorch state dictionaries, merging …

[example] Added gemma support

generate.py: do not use args in function main

index out of range: No transformer config could be loaded

Reducing Latency in Application with Torch Compilation: Initialization and Inference Optimization

[quant] Add int8 per token dynamic quant + int4 per group quant for ExecuTorch

Int4 perplexity

Question about the gennerated code of `WeightOnlyInt8Linear`

← Metadata

Owner

Metadata

gpt-fast gpt-fast copied to clipboard

Metadata

← Metadata

Owner

Metadata

gpt-fast
gpt-fast copied to clipboard