gpt-fast icon indicating copy to clipboard operation
gpt-fast copied to clipboard

[example] changed int8 quantization to do fp8 weight-only quantization

Open Chillee opened this issue 1 year ago • 3 comments

In this case I'm guessing that for fp8 you might not need a scale parameter for the weights, since each weight has its own scaling factor.

I haven't done any evals, but this is just an example of weight-only fp8 support if folks want to play with it :P

Perf is at 102.9 tok/s for fp8 vs. 103.8 tok/s for int8 quantization.

Chillee avatar Dec 06 '23 03:12 Chillee

May we keep both - int8 and fp8? Why replacing one to another, especially seeing perf degradation (subtle, but still)?

Artyom17 avatar Mar 01 '24 00:03 Artyom17

It's just an example PR - not intending to merge it.

Chillee avatar Mar 01 '24 01:03 Chillee