ABQ-LLM icon indicating copy to clipboard operation
ABQ-LLM copied to clipboard

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

Results 13 ABQ-LLM issues
Sort by recently updated
recently updated
newest added

Great job, starred! I do have a few questions: 1. Did you test the e2e generation speed, specifically in terms of tokens/second or the latency of the first token? 2....

I noticed that the dimension of seqlen (variable M in engine/test.sh) in kernel Benchmark is very small. Does this mean that the test only considers the decode stage and ignores...

I use this command to quantize llama2-7b-chat model, but the model size dosen't change. CUDA_VISIBLE_DEVICES=0 python3 main.py \ --model /mnt/home/model/llama2-7b-chat-hf \ --epochs 20 --output_dir ./log/llama2-7b-w2a8 \ --eval_ppl --wbits 2 --abits...

As mentioned in README, [Note that due to the limitations of AutoGPTQ kernels, the real quantization of weight-only quantization can only lead memory reduction, but with slower inference speed.] I'm...

Hi, if I have a linear layer the weight only has the value of {0, 1, -1}. Is it possible to utilize your kernel for weight compression and inference speed-up?...

hi! where is the pre-trained ABQ-LLM model zoo?

for chat models, calibration datasets' input_ids&attn_masks should be passed in

build wheel by fellowing steps ``` cd algorithm python setup.py build ``` but i got above errors with **cuda 12.1 version** and conda python env **abq-llm** . Can repo provide...

Thanks for your great work! I want to know how to reproduce the results of end-to-end throughput experiments? That is e2e_speed. png. Can you provide the complete code integrated into...