ppl.pmx icon indicating copy to clipboard operation
ppl.pmx copied to clipboard

support int4 weight only quant for llama3

Open FlamingoPg opened this issue 1 year ago • 4 comments

Support int4 weight only quantization for Llama3

  1. Define the weight only layer in ModelParallel.py
  2. Define ConvertWeightToOpmx.py and add quant here
  3. Update Dynamic Static modeling for quant model
  4. Fix the Woqu function for packed model
  5. Add weight only quant README.md
  6. Test export and demo under static and dynamic batching on A100 GPU

Some WIP:

  1. Support AutoAWQ when ConvertWeightToOpmx

FlamingoPg avatar Jul 01 '24 17:07 FlamingoPg

@Jzz24 hi, I finish this PR. Can you review it?

FlamingoPg avatar Jul 05 '24 06:07 FlamingoPg

@Jzz24 hi, I finish this PR. Can you review it?

OK

Jzz24 avatar Jul 09 '24 08:07 Jzz24

用naive 量化实现的Llama3模型导出,后续补上awq的。讨论后决定暂不做SplitModel 和 MergeModel

FlamingoPg avatar Jul 15 '24 17:07 FlamingoPg

上个PR没合,我就在那个基础上继续写了。代码太多可能review起来有些困难QAQ

整体的逻辑是这样的,我把量化的步骤定义在了ConvertWeightToOpmx.py里,这个文件完成量化的工作。之后在Modeling里我适配了量化。量化层的定义也写在了ModelParallel.py里。

FlamingoPg avatar Jul 16 '24 16:07 FlamingoPg