gpt-fast icon indicating copy to clipboard operation
gpt-fast copied to clipboard

[quant] Add int8 per token dynamic quant + int4 per group quant for ExecuTorch

Open jerryzh168 opened this issue 4 months ago • 1 comments

Stack from ghstack (oldest at bottom):

  • -> #102

Summary: att

Adding this for accuracy evaluation, we also added this in executorch repo and we'll dedup later

Test Plan:

quantization:

python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode 8da4w-gptq --calibration_tasks wikitext --calibration_limit 5

this finished in 20+ min in my machine if you change calibration_limit to 1, then it can be finished in 10+ min, but expect worse quality since we do less calibration (use this for debugging a new quantization experiment)

evaluation:

python eval.py --checkpoint_path checkpoints/$MODEL_REPO/model_8da4w-gptq.g32.pth --tasks wikitext

This should be fast, the result I'm getting is:

wikitext: {'word_perplexity,none': 10.15655335078972, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.5726497149737177, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6531973670369153, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:

jerryzh168 avatar Feb 08 '24 17:02 jerryzh168