[Feature]: WARNING: Model is quantized. Forcing float16 datatype
🚀 The feature, motivation and pitch
float32 support when model is quantized. I very much want this because P40 only does FP32 and INT8
Alternatives
No response
Additional context
No response
Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.
Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.
How come when we are running GGUF models it still gives that warning that its forcing Float16 but compute seems to happen in FP32 anyways? GGUF right now is the only way to get good performance out of Pascal cards with only decent FP32 performance.
Would being able to set dtype to Float32 for say an AWQ model let us do FP32 compute on Pascal cards?
Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.
How come when we are running GGUF models it still gives that warning that its forcing Float16 but compute seems to happen in FP32 anyways? GGUF right now is the only way to get good performance out of Pascal cards with only decent FP32 performance.
Would being able to set dtype to Float32 for say an AWQ model let us do FP32 compute on Pascal cards?
because there is no kernel for FP32? that's my guess
Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.
How come when we are running GGUF models it still gives that warning that its forcing Float16 but compute seems to happen in FP32 anyways? GGUF right now is the only way to get good performance out of Pascal cards with only decent FP32 performance. Would being able to set dtype to Float32 for say an AWQ model let us do FP32 compute on Pascal cards?
because there is no kernel for FP32? that's my guess
No what I'm saying is it is doing compute in FP32 for GGUF because the GGUF kernel does just use FP32.
We no longer force this as of v0.6.0