AMDMIGraphX FP8 lossy downcast issue with "ref" implementation

https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/pull/2506/files This PR had to disable FP8 tests for the CPU backend.

Ref implementation is doing Float -- > Fp8 -- > Float conversion but CPU backend is doing entire test in Float.

Therefore results come out slightly different.

need to figure out way to enable those tests again.

e.g.

ef:
module: "main"
@0 = @literal{2} -> float_type, {1}, {0}, target_id=0
@1 = @literal{3} -> float_type, {1}, {0}, target_id=0
c = @param:c -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
b = @param:b -> fp8e4m3fnuz_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
a = @param:a -> fp8e4m3fnuz_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@5 = transpose[permutation={0, 1, 3, 2}](a) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@6 = transpose[permutation={0, 1, 3, 2}](b) -> fp8e4m3fnuz_type, {3, 2, 8, 7}, {112, 56, 1, 8}, target_id=0
@7 = multibroadcast[out_lens={3, 2, 2, 8},out_dyn_dims={}](@1) -> float_type, {3, 2, 2, 8}, {0, 0, 0, 0}, target_id=0
@8 = convert[target_type=2](@5) -> float_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@9 = mul(@7,@8) -> float_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@10 = convert[target_type=12](@9) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@11 = quant_dot(@10,@6) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@12 = multibroadcast[out_lens={3, 2, 2, 7},out_dyn_dims={}](@0) -> float_type, {3, 2, 2, 7}, {0, 0, 0, 0}, target_id=0
@13 = mul(c,@12) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@14 = add(@11,@13) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0

## ref quant_dot internally converts fp8e4m3fnuz_type to  float and does the matrix multiplication
# Float - > fp8 --> float
cpu:
module: "main"
@0 = cpu::preallocate[shape=int8_type, {1008}, {1},id=main:scratch] -> int8_type, {1008}, {1}, target_id=0
@1 = cpu::literal -> float_type, {3, 2, 2, 8}, {32, 16, 8, 1}, target_id=0
@2 = cpu::literal -> float_type, {1}, {0}, target_id=0
c = @param:c -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
b = @param:b -> fp8e4m3fnuz_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
a = @param:a -> fp8e4m3fnuz_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@6 = convert[target_type=2](a) -> float_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@7 = transpose[permutation={0, 1, 3, 2}](@6) -> float_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@8 = convert[target_type=2](b) -> float_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
@9 = transpose[permutation={0, 1, 3, 2}](@8) -> float_type, {3, 2, 8, 7}, {112, 56, 1, 8}, target_id=0
@10 = load[offset=336,end=720](@0) -> float_type, {3, 2, 2, 8}, {32, 16, 8, 1}, target_id=0
@11 = dnnl::binary[post_ops={},algo=binary_mul](@1,@7,@10) -> float_type, {3, 2, 2, 8}, {32, 16, 8, 1}, target_id=0
@12 = load[offset=0,end=336](@0) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@13 = dnnl::dot[post_ops={}](@11,@9,@12) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@14 = multibroadcast[out_lens={3, 2, 2, 7},out_dyn_dims={}](@2) -> float_type, {3, 2, 2, 7}, {0, 0, 0, 0}, target_id=0
@15 = load[offset=672,end=1008](@0) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@16 = dnnl::binary[post_ops={},algo=binary_mul](c,@14,@15) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@17 = load[offset=336,end=672](@0) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@18 = dnnl::binary[post_ops={},algo=binary_add](@13,@16,@17) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0

## GPU : 
# float - > fp8 -- > (fp8 inputs -->float32 accumulation) --> Float
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}, target_id=0
@1 = hip::hip_allocate_memory[shape=int8_type, {432}, {1},id=main:scratch] -> int8_type, {432}, {1}, target_id=0
@2 = load[offset=336,end=432](@1) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
a = @param:a -> fp8e4m3fnuz_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@4 = transpose[permutation={0, 1, 3, 2}](a) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@5 = gpu::code_object[code_object=9120,symbol_name=convert_mul_convert_kernel,global=96,local=1024,](@4,@2) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@6 = load[offset=0,end=336](@1) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
b = @param:b -> fp8e4m3fnuz_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
@8 = transpose[permutation={0, 1, 3, 2}](b) -> fp8e4m3fnuz_type, {3, 2, 8, 7}, {112, 56, 1, 8}, target_id=0
@9 = gpu::quant_gemm[alpha=1,beta=0,compute_fp32=1,trans_batch=0,solution_idx=0](@5,@8,@6) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
output = @param:output -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
c = @param:c -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@12 = gpu::code_object[code_object=9288,symbol_name=mul_add_kernel,global=42,local=1024,](c,@9,output) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0

Dec 05 '23 21:12 umangyadav

Fix for this issue should work for all the hardwares includign MI300.

e.g. #2506 attempted fix for this by adding simplication for nested converts but it didnt' work on Mi300.

Dec 06 '23 23:12 umangyadav

@lakhinderwalia FYI

Apr 16 '24 17:04 umangyadav

Thanks, @umangyadav. Yes, the right thing is to disable such apples-to-oranges tests. In this case the issue (ref vs GPU of test_quantizelinear_convert) is very similar, and to assume that it is working fine while the GPU execution optimizes out the convert step is simply an incorrect approach to test.

Apr 16 '24 18:04 lakhinderwalia