ABQ-LLM issues

[Question] The end-to-end generation speed and W4A4

Great job, starred! I do have a few questions: 1. Did you test the e2e generation speed, specifically in terms of tokens/second or the latency of the first token? 2....

aur61

Seqlen of Kernel Benchmark

2

I noticed that the dimension of seqlen (variable M in engine/test.sh) in kernel Benchmark is very small. Does this mean that the test only considers the decode stage and ignores...

Sekri0

I use this command to quantize llama2-7b-chat model, but the model size dosen't change. CUDA_VISIBLE_DEVICES=0 python3 main.py \ --model /mnt/home/model/llama2-7b-chat-hf \ --epochs 20 --output_dir ./log/llama2-7b-w2a8 \ --eval_ppl --wbits 2 --abits...

Sekri0

CUDA kernel of weight only quantization

1

As mentioned in README, [Note that due to the limitations of AutoGPTQ kernels, the real quantization of weight-only quantization can only lead memory reduction, but with slower inference speed.] I'm...

Sekri0

About W2A16 weight only matmul

2

Hi, if I have a linear layer the weight only has the value of {0, 1, -1}. Is it possible to utilize your kernel for weight compression and inference speed-up?...

goddice

The link to model zoo

1

hi! where is the pre-trained ABQ-LLM model zoo?

RanchiZhao

Custom datasets support for chat models

1

for chat models, calibration datasets' input_ids&attn_masks should be passed in

RanchiZhao

Is there a plan to support model Qwen2?

1

gloritygithub11

whl compile error：suitable constructor exists to convert from "int" to "__half"

3

build wheel by fellowing steps ``` cd algorithm python setup.py build ``` but i got above errors with **cuda 12.1 version** and conda python env **abq-llm** . Can repo provide...

CalebDu

How to reproduce the results of end-to-end throughput experiment?

4

Thanks for your great work! I want to know how to reproduce the results of end-to-end throughput experiments? That is e2e_speed. png. Can you provide the complete code integrated into...

FlyFoxPlayer

ABQ-LLM
ABQ-LLM copied to clipboard

Metadata

[Question] The end-to-end generation speed and W4A4

Seqlen of Kernel Benchmark

No reduction in model size

CUDA kernel of weight only quantization

About W2A16 weight only matmul

The link to model zoo

Custom datasets support for chat models

Is there a plan to support model Qwen2?

whl compile error：suitable constructor exists to convert from "int" to "__half"

How to reproduce the results of end-to-end throughput experiment?

← Metadata

Owner

Metadata

ABQ-LLM ABQ-LLM copied to clipboard

Metadata

← Metadata

Owner

Metadata

ABQ-LLM
ABQ-LLM copied to clipboard