QPyTorch
QPyTorch copied to clipboard
Floatpoint(8,23)flips the input values
Hi,
I have tried the following code a=torch.tensor([3.0]) out=float_quantize(a,8,23,"nearest")
The output is printed as -3.0.
This happens only when the rounding is nearest .I am not able to understand why is this happening. Can you please explain me why is this happening, as I am missing something here.
what is printed out when you don't use nearest rounding?
When I use stochastic rounding, the same input number is printed.
hi @ASHWIN2605
Good catch, I think this is an edge case. I'll look into the code soon.
But 8bits exponent, 23 bits mantissa is the standard fp32 format anyways so I don't think you want to quantize it anyways.
Hello,
This is from round_bitwise function in quant_cpu.cpp.
Specifically rand_prob = 1 << (23 - man_bits - 1); when man_bit = 23 then it becomes rand_prob = 1 << -1;