GPTQ-triton issues

GPTQ guide?

1

Is there a guide to learning how GPTQ works? I found the paper hard to follow... thanks!

Replace transformer apply_rotary_pos_emb with triton version

5

This is a port of https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/221/files by @aljungberg to this repo. On my 4090 4bit + group-size:512 + true-sequential 30b model inference test I saw about 8-10% speed up for...

Qubitium

rotary embedding and layer norm

1

I've added two enhancements to the current GPTQ for LLaMA. This brings speed up. 1.triton rotary embedding implemented by [aljungberg](https://github.com/aljungberg) https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/221 Implement rotary embedding with triton. This gives a huge...

qwopqwop200

question about the quantization formula

3

> the weights are decoded using the formula `w = (w - z - 1) * s`. I wonder why we need to use z - 1 here since the...

irasin

Cache auto-tuning?

3

When running the model--especially in a serverless environment where there may be many cold starts--it would be desirable to cache the auto-tuning results. Is this possible?

vedantroy

Cuda vs Triton on an RTX 3060 12GB

12

cuda: 35tokens/s triton: 5tokens/s I used ooba's webui only for cuda, because I've been unable to get triton to work with ooba's webui, I made sure i used the same...

1aienthusiast

1-bit acceleration support

2

Hi, really good work, and appreciate it a lot. I am curious whether Triton can support 1-bit acceleration for MMA. Also the further application to 1-bit GPTQ?

NicoNico6

Needs more VRAM than normal GPTQ CUDA version?

3

Thanks, I wanted to try your triton version. But I only have 8 GB RAM. The GPTQ Cuda versions works (7B model). Your version (the ppl script) crashes with CUDA...

DanielWe2

load quantized model error

I'm trying to load the 7B quantized model (which I quantized using the script in this repository) using **NVIDIA TITAN Xp**. But I get the following errors. this one with...

wanghz18

GPTQ-triton
GPTQ-triton copied to clipboard

Metadata

GPTQ guide?

Replace transformer apply_rotary_pos_emb with triton version

rotary embedding and layer norm

question about the quantization formula

Cache auto-tuning?

Cuda vs Triton on an RTX 3060 12GB

1-bit acceleration support

Needs more VRAM than normal GPTQ CUDA version?

load quantized model error

← Metadata

Owner

Metadata

GPTQ-triton GPTQ-triton copied to clipboard

Metadata

← Metadata

Owner

Metadata

GPTQ-triton
GPTQ-triton copied to clipboard