Casper
Casper
> Thank you. > > Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data...
The amount of work scales linearly. The problem is when you increase batch size too much because then your GPU will be 100% utilized just doing matrix multiplication. Once that...
This is the difference between memory-bound and compute-bound. At small batch sizes, you are memory bound meaning that you are limited by how fast you can pass the model’s weights...
Weight loading happens at startup time and then it’s transported through registers. This process is not really transparent but it all happens in the quantization kernel that you can find...
Hi @ArthurZucker, yes this is one of the issues. I have released 0.2.4 which has pinned transformers
Thanks @ArthurZucker, I appreciate collaboration here to make the best of quantized models. At present time, I will not be able to provide support for quantizing newer models (e.g. QWen2MoE)...
@chu-tianxiang Great job on optimizing GPTQ! Is there another option than repacking for AWQ?
> > @chu-tianxiang Great job on optimizing GPTQ! Is there another option than repacking for AWQ? > > I can implement the AWQ kernel based on current AWQ gemm implementation...
This is excellent work! Looking forward to seeing this merged for a big speedup.
@chu-tianxiang On a side note, I tried importing the kernels from here to AutoAWQ and I am getting CUDA illegal memory access on multi-GPU while it works fine on a...