quanto Potential readme issue - falls back to original dtype, not fp32

Potential readme issue - falls back to original dtype, not fp32

Open calmitchell617 opened this issue 10 months ago • 1 comments

In the docs, it says that when quantizing to anything other than int8, many operations will fall back to fp32.

However, looking through the code (and inserting some print lines) it seems like it actually casts tensors back to the dtype they were originally quantized from, which is not always fp32. In many cases with modern NLP models, it is bf16.

Is this the case?

Apr 16 '24 16:04 calmitchell617

I agree this is outdated. What actually happens for matrix multiplications is that the tensors are dequantized back to their original type, except if both tensors are int8. For int8 there are several options:

if device is cuda, then torch._int_mm is used,
otherwise the tensors are cast to float32 and torch.mm is used.

I am considering removing that last possibility as I think it slows things down.

Apr 17 '24 07:04 dacorvo

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 18 '24 01:05 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

May 23 '24 01:05 github-actions[bot]

quanto quanto copied to clipboard

Potential readme issue - falls back to original dtype, not fp32

quanto
quanto copied to clipboard