FP8-Emulation-Toolkit
FP8-Emulation-Toolkit copied to clipboard
Why does the quantized value still exceed the range of FP8 representation?
Hi, thanks for providing such complete toolkit, but I have some questions about this toolkit.
I use this toolkit to evaluate resnet18 on cifar10 with FP8 in hybrid mode, I find that after this operation, outputs still exceed the range of FP8 representation.
outputs = fpemu_cuda.forward(input.contiguous(), mode, size, inplace, scale, blocknorm, blocksize)
For example, input.data[0,0] before this operation is
tensor([[-0.0089, 0.0410, 0.0068],
[ 0.0663, 0.0292, 0.0986],
[ 0.0737, 0.0730, 0.0111]], device='cuda:1', dtype=torch.float16)
after this operation, output[0,0] is
tensor([[-0.0090, 0.0405, 0.0068],
[ 0.0676, 0.0293, 0.0991],
[ 0.0721, 0.0721, 0.0113]], device='cuda:1', dtype=torch.float16)
mode is 'E4M3_RNE' . The problem is, the first output number -0.009 in binary is 1 01000 0010011100, this number obviously exceed the range of FP8. Why -0.0089 can be quantized to -0.0090 under 'E4M3_RNE' mode?
Thanks for reading and hope to hear back from you soon.