quanto
quanto copied to clipboard
Potential readme issue - falls back to original dtype, not fp32
In the docs, it says that when quantizing to anything other than int8, many operations will fall back to fp32.
However, looking through the code (and inserting some print lines) it seems like it actually casts tensors back to the dtype they were originally quantized from, which is not always fp32. In many cases with modern NLP models, it is bf16.
Is this the case?
I agree this is outdated. What actually happens for matrix multiplications is that the tensors are dequantized back to their original type, except if both tensors are int8. For int8 there are several options:
- if device is cuda, then torch._int_mm is used,
- otherwise the tensors are cast to float32 and torch.mm is used.
I am considering removing that last possibility as I think it slows things down.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.