gpt-fast [quant] Add int8 per token dynamic quant + int4 per group quant for ExecuTorch

[quant] Add int8 per token dynamic quant + int4 per group quant for ExecuTorch

Open jerryzh168 opened this issue 1 year ago • 1 comments

Stack from ghstack (oldest at bottom):

-> #102

Summary: att

Adding this for accuracy evaluation, we also added this in executorch repo and we'll dedup later

Test Plan:

quantization:

python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode 8da4w-gptq --calibration_tasks wikitext --calibration_limit 5

this finished in 20+ min in my machine if you change calibration_limit to 1, then it can be finished in 10+ min, but expect worse quality since we do less calibration (use this for debugging a new quantization experiment)

evaluation:

python eval.py --checkpoint_path checkpoints/$MODEL_REPO/model_8da4w-gptq.g32.pth --tasks wikitext

This should be fast, the result I'm getting is:

wikitext: {'word_perplexity,none': 10.15655335078972, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.5726497149737177, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6531973670369153, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:

Feb 08 '24 17:02 jerryzh168

we're going to add this to torchao instead

Mar 08 '24 01:03 jerryzh168

gpt-fast gpt-fast copied to clipboard

[quant] Add int8 per token dynamic quant + int4 per group quant for ExecuTorch

gpt-fast
gpt-fast copied to clipboard