Casper comments

Results 303 comments of


                                            Casper

GGUF support

> Thank you. > > Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data...

GGUF support

The amount of work scales linearly. The problem is when you increase batch size too much because then your GPU will be 100% utilized just doing matrix multiplication. Once that...

GGUF support

This is the difference between memory-bound and compute-bound. At small batch sizes, you are memory bound meaning that you are limited by how fast you can pass the model’s weights...

GGUF support

Weight loading happens at startup time and then it’s transported through registers. This process is not really transparent but it all happens in the quantization kernel that you can find...

AttributeError: 'Catcher' object has no attribute 'self_attn' #29352

Hi @ArthurZucker, yes this is one of the issues. I have released 0.2.4 which has pinned transformers

AttributeError: 'Catcher' object has no attribute 'self_attn' #29352

Thanks @ArthurZucker, I appreciate collaboration here to make the best of quantized models. At present time, I will not be able to provide support for quantizing newer models (e.g. QWen2MoE)...

GPTQ & AWQ Fused MOE

@chu-tianxiang Great job on optimizing GPTQ! Is there another option than repacking for AWQ?

GPTQ & AWQ Fused MOE

> > @chu-tianxiang Great job on optimizing GPTQ! Is there another option than repacking for AWQ? > > I can implement the AWQ kernel based on current AWQ gemm implementation...

GPTQ & AWQ Fused MOE

This is excellent work! Looking forward to seeing this merged for a big speedup.

GPTQ & AWQ Fused MOE

@chu-tianxiang On a side note, I tried importing the kernels from here to AutoAWQ and I am getting CUDA illegal memory access on multi-GPU while it works fine on a...