aphrodite-engine [Feature]: WARNING: Model is quantized. Forcing float16 datatype

🚀 The feature, motivation and pitch

float32 support when model is quantized. I very much want this because P40 only does FP32 and INT8

Alternatives

No response

Additional context

No response

May 28 '24 15:05 sorasoras

Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.

May 28 '24 15:05 AlpinDale

Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.

How come when we are running GGUF models it still gives that warning that its forcing Float16 but compute seems to happen in FP32 anyways? GGUF right now is the only way to get good performance out of Pascal cards with only decent FP32 performance.

Would being able to set dtype to Float32 for say an AWQ model let us do FP32 compute on Pascal cards?

Jun 01 '24 07:06 Nero10578

Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.

How come when we are running GGUF models it still gives that warning that its forcing Float16 but compute seems to happen in FP32 anyways? GGUF right now is the only way to get good performance out of Pascal cards with only decent FP32 performance.

Would being able to set dtype to Float32 for say an AWQ model let us do FP32 compute on Pascal cards?

because there is no kernel for FP32? that's my guess

Jun 01 '24 17:06 sorasoras

Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.

How come when we are running GGUF models it still gives that warning that its forcing Float16 but compute seems to happen in FP32 anyways? GGUF right now is the only way to get good performance out of Pascal cards with only decent FP32 performance. Would being able to set dtype to Float32 for say an AWQ model let us do FP32 compute on Pascal cards?

because there is no kernel for FP32? that's my guess

No what I'm saying is it is doing compute in FP32 for GGUF because the GGUF kernel does just use FP32.

Jun 01 '24 19:06 Nero10578

We no longer force this as of v0.6.0

Sep 03 '24 13:09 AlpinDale