FP8-Emulation-Toolkit icon indicating copy to clipboard operation
FP8-Emulation-Toolkit copied to clipboard

Why does the quantized value still exceed the range of FP8 representation?

Open adfad1 opened this issue 1 year ago • 0 comments

Hi, thanks for providing such complete toolkit, but I have some questions about this toolkit.

I use this toolkit to evaluate resnet18 on cifar10 with FP8 in hybrid mode, I find that after this operation, outputs still exceed the range of FP8 representation. outputs = fpemu_cuda.forward(input.contiguous(), mode, size, inplace, scale, blocknorm, blocksize) For example, input.data[0,0] before this operation is

tensor([[-0.0089,  0.0410,  0.0068],
        [ 0.0663,  0.0292,  0.0986],
        [ 0.0737,  0.0730,  0.0111]], device='cuda:1', dtype=torch.float16)

after this operation, output[0,0] is

tensor([[-0.0090,  0.0405,  0.0068],
        [ 0.0676,  0.0293,  0.0991],
        [ 0.0721,  0.0721,  0.0113]], device='cuda:1', dtype=torch.float16)

mode is 'E4M3_RNE' . The problem is, the first output number -0.009 in binary is 1 01000 0010011100, this number obviously exceed the range of FP8. Why -0.0089 can be quantized to -0.0090 under 'E4M3_RNE' mode?

Thanks for reading and hope to hear back from you soon.

adfad1 avatar Mar 30 '24 16:03 adfad1