GPTQ-triton icon indicating copy to clipboard operation
GPTQ-triton copied to clipboard

GPTQ inference Triton kernel

Results 9 GPTQ-triton issues
Sort by recently updated
recently updated
newest added

Is there a guide to learning how GPTQ works? I found the paper hard to follow... thanks!

This is a port of https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/221/files by @aljungberg to this repo. On my 4090 4bit + group-size:512 + true-sequential 30b model inference test I saw about 8-10% speed up for...

I've added two enhancements to the current GPTQ for LLaMA. This brings speed up. 1.triton rotary embedding implemented by [aljungberg](https://github.com/aljungberg) https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/221 Implement rotary embedding with triton. This gives a huge...

> the weights are decoded using the formula `w = (w - z - 1) * s`. I wonder why we need to use z - 1 here since the...

When running the model--especially in a serverless environment where there may be many cold starts--it would be desirable to cache the auto-tuning results. Is this possible?

cuda: 35tokens/s triton: 5tokens/s I used ooba's webui only for cuda, because I've been unable to get triton to work with ooba's webui, I made sure i used the same...

Hi, really good work, and appreciate it a lot. I am curious whether Triton can support 1-bit acceleration for MMA. Also the further application to 1-bit GPTQ?

Thanks, I wanted to try your triton version. But I only have 8 GB RAM. The GPTQ Cuda versions works (7B model). Your version (the ppl script) crashes with CUDA...

I'm trying to load the 7B quantized model (which I quantized using the script in this repository) using **NVIDIA TITAN Xp**. But I get the following errors. this one with...