AMDMIGraphX icon indicating copy to clipboard operation
AMDMIGraphX copied to clipboard

FP8 lossy downcast issue with "ref" implementation

Open umangyadav opened this issue 2 years ago • 3 comments

https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/pull/2506/files This PR had to disable FP8 tests for the CPU backend.

Ref implementation is doing Float -- > Fp8 -- > Float conversion but CPU backend is doing entire test in Float.

Therefore results come out slightly different.

need to figure out way to enable those tests again.

e.g.

ef:
module: "main"
@0 = @literal{2} -> float_type, {1}, {0}, target_id=0
@1 = @literal{3} -> float_type, {1}, {0}, target_id=0
c = @param:c -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
b = @param:b -> fp8e4m3fnuz_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
a = @param:a -> fp8e4m3fnuz_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@5 = transpose[permutation={0, 1, 3, 2}](a) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@6 = transpose[permutation={0, 1, 3, 2}](b) -> fp8e4m3fnuz_type, {3, 2, 8, 7}, {112, 56, 1, 8}, target_id=0
@7 = multibroadcast[out_lens={3, 2, 2, 8},out_dyn_dims={}](@1) -> float_type, {3, 2, 2, 8}, {0, 0, 0, 0}, target_id=0
@8 = convert[target_type=2](@5) -> float_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@9 = mul(@7,@8) -> float_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@10 = convert[target_type=12](@9) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@11 = quant_dot(@10,@6) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@12 = multibroadcast[out_lens={3, 2, 2, 7},out_dyn_dims={}](@0) -> float_type, {3, 2, 2, 7}, {0, 0, 0, 0}, target_id=0
@13 = mul(c,@12) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@14 = add(@11,@13) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0

## ref quant_dot internally converts fp8e4m3fnuz_type to  float and does the matrix multiplication
# Float - > fp8 --> float
cpu:
module: "main"
@0 = cpu::preallocate[shape=int8_type, {1008}, {1},id=main:scratch] -> int8_type, {1008}, {1}, target_id=0
@1 = cpu::literal -> float_type, {3, 2, 2, 8}, {32, 16, 8, 1}, target_id=0
@2 = cpu::literal -> float_type, {1}, {0}, target_id=0
c = @param:c -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
b = @param:b -> fp8e4m3fnuz_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
a = @param:a -> fp8e4m3fnuz_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@6 = convert[target_type=2](a) -> float_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@7 = transpose[permutation={0, 1, 3, 2}](@6) -> float_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@8 = convert[target_type=2](b) -> float_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
@9 = transpose[permutation={0, 1, 3, 2}](@8) -> float_type, {3, 2, 8, 7}, {112, 56, 1, 8}, target_id=0
@10 = load[offset=336,end=720](@0) -> float_type, {3, 2, 2, 8}, {32, 16, 8, 1}, target_id=0
@11 = dnnl::binary[post_ops={},algo=binary_mul](@1,@7,@10) -> float_type, {3, 2, 2, 8}, {32, 16, 8, 1}, target_id=0
@12 = load[offset=0,end=336](@0) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@13 = dnnl::dot[post_ops={}](@11,@9,@12) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@14 = multibroadcast[out_lens={3, 2, 2, 7},out_dyn_dims={}](@2) -> float_type, {3, 2, 2, 7}, {0, 0, 0, 0}, target_id=0
@15 = load[offset=672,end=1008](@0) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@16 = dnnl::binary[post_ops={},algo=binary_mul](c,@14,@15) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@17 = load[offset=336,end=672](@0) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@18 = dnnl::binary[post_ops={},algo=binary_add](@13,@16,@17) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0

## GPU : 
# float - > fp8 -- > (fp8 inputs -->float32 accumulation) --> Float
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}, target_id=0
@1 = hip::hip_allocate_memory[shape=int8_type, {432}, {1},id=main:scratch] -> int8_type, {432}, {1}, target_id=0
@2 = load[offset=336,end=432](@1) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
a = @param:a -> fp8e4m3fnuz_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@4 = transpose[permutation={0, 1, 3, 2}](a) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@5 = gpu::code_object[code_object=9120,symbol_name=convert_mul_convert_kernel,global=96,local=1024,](@4,@2) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@6 = load[offset=0,end=336](@1) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
b = @param:b -> fp8e4m3fnuz_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
@8 = transpose[permutation={0, 1, 3, 2}](b) -> fp8e4m3fnuz_type, {3, 2, 8, 7}, {112, 56, 1, 8}, target_id=0
@9 = gpu::quant_gemm[alpha=1,beta=0,compute_fp32=1,trans_batch=0,solution_idx=0](@5,@8,@6) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
output = @param:output -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
c = @param:c -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@12 = gpu::code_object[code_object=9288,symbol_name=mul_add_kernel,global=42,local=1024,](c,@9,output) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0

umangyadav avatar Dec 05 '23 21:12 umangyadav

Fix for this issue should work for all the hardwares includign MI300.

e.g. #2506 attempted fix for this by adding simplication for nested converts but it didnt' work on Mi300.

umangyadav avatar Dec 06 '23 23:12 umangyadav

@lakhinderwalia FYI

umangyadav avatar Apr 16 '24 17:04 umangyadav

Thanks, @umangyadav. Yes, the right thing is to disable such apples-to-oranges tests. In this case the issue (ref vs GPU of test_quantizelinear_convert) is very similar, and to assume that it is working fine while the GPU execution optimizes out the convert step is simply an incorrect approach to test.

lakhinderwalia avatar Apr 16 '24 18:04 lakhinderwalia