zengchao0424 comments

Results 7 comments of


                                            zengchao0424

The link to model zoo

> hi! where is the pre-trained ABQ-LLM model zoo? hello, we will release the quantized models mentioned in the paper, pending internal company approval. We will share the HF link...

Custom datasets support for chat models

If you want to customize your own calibration dataset, you can add support for your dataset in the `get_loader` function inside `ABQ-LLM/algorithm/datautils.py`.For example, you can implement the `get_chat` function to...

No reduction in model size

Hello, the weights obtained through this way are the calibrated fake quantization weights. To achieve actual weight compression, a packing operation is required to store the weights. For example, using...

CUDA kernel of weight only quantization

Thanks for your attention to our work. Matrix multiplication of int and float is not supported, but based on our experience in model optimization, the effect of int16 and float16...

Is there a plan to support model Qwen2?

Hello, Qwen2 implements attention calculation using GQA. In our implementation, we have added support for GQA, and using our LLaMA implementation, it can support GQA models like LLaMA-3. The model...

Poor Performance of W4A4 on MMLU task

In practice, W4A4 low-bit quantization algorithms optimized under the "smooth paradigm" often encounter performance bottlenecks on complex evaluation tasks, particularly when applied to cutting-edge models such as LLaMA-3.1. These state-of-the-art...

Question about W4A16 Benchmark

Hi, for the W4A16 configuration, most mainstream quantization algorithms can achieve good results. If using a per-channel setting for W4A16, you can choose any quantization algorithm you prefer. However, in...