aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

[Feature]: WARNING: Model is quantized. Forcing float16 datatype

Open sorasoras opened this issue 1 year ago • 4 comments

🚀 The feature, motivation and pitch

float32 support when model is quantized. I very much want this because P40 only does FP32 and INT8

Alternatives

No response

Additional context

No response

sorasoras avatar May 28 '24 15:05 sorasoras

Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.

AlpinDale avatar May 28 '24 15:05 AlpinDale

Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.

How come when we are running GGUF models it still gives that warning that its forcing Float16 but compute seems to happen in FP32 anyways? GGUF right now is the only way to get good performance out of Pascal cards with only decent FP32 performance.

Would being able to set dtype to Float32 for say an AWQ model let us do FP32 compute on Pascal cards?

Nero10578 avatar Jun 01 '24 07:06 Nero10578

Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.

How come when we are running GGUF models it still gives that warning that its forcing Float16 but compute seems to happen in FP32 anyways? GGUF right now is the only way to get good performance out of Pascal cards with only decent FP32 performance.

Would being able to set dtype to Float32 for say an AWQ model let us do FP32 compute on Pascal cards?

because there is no kernel for FP32? that's my guess

sorasoras avatar Jun 01 '24 17:06 sorasoras

Good call, I didn't think of that. I'll make it so that the dtype flag takes priority over it for the next release.

How come when we are running GGUF models it still gives that warning that its forcing Float16 but compute seems to happen in FP32 anyways? GGUF right now is the only way to get good performance out of Pascal cards with only decent FP32 performance. Would being able to set dtype to Float32 for say an AWQ model let us do FP32 compute on Pascal cards?

because there is no kernel for FP32? that's my guess

No what I'm saying is it is doing compute in FP32 for GGUF because the GGUF kernel does just use FP32.

Nero10578 avatar Jun 01 '24 19:06 Nero10578

We no longer force this as of v0.6.0

AlpinDale avatar Sep 03 '24 13:09 AlpinDale