Alexander Matveev comments

Results 12 comments of


                                            Alexander Matveev

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models.

@davidgxue We have initial correctness on 8bit marlin, will do some perf checks and more testing and will put PR in a couple of days.

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models.

Btw, the new 8bit marlin will support all group_sizes and act_order

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models.

@davidgxue here we add 8-bit support https://github.com/vllm-project/vllm/pull/4533

Add GPTQ Marlin 2:4 sparse structured support

Benchmark results on A100 for Yi-34B Chat model that has marlin_24 serialized weights (where the actual weight values are not real yet). This is just to show preliminary results to...

Add GPTQ Marlin 2:4 sparse structured support

@pcmoritz This is good idea. Changed the API to return str or None and moved the gptq specific override logic to the override funcs.

Add GPTQ Marlin 2:4 sparse structured support

Cool, fixed the nit and some other little things.

Add GPTQ Marlin 2:4 sparse structured support

Thanks for the suggestions!

[Kernel] add bfloat16 support for gptq marlin kernel

@bnellnm could you do a quick pass on the template changes.

[Kernel] add bfloat16 support for gptq marlin kernel

@jinzhen-lin I think your code is in good state to land after addressing last comments.

[Kernel] add bfloat16 support for gptq marlin kernel

@jinzhen-lin thanks for adding the tests and fixing all comments. @robertgshaw2-neuralmagic looks good to me to proceed forward.