oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

AArch64: graph fusion and graph ci cases failed in AArch64 platform

Open xiang1guo opened this issue 4 months ago • 6 comments

This is created to track graph-related failure cases in AArch64 platform. Currently, they are skipped in CI testing script.

xiang1guo avatar Aug 26 '25 07:08 xiang1guo

I did some poking around and I am having some difficulty reproducing failures seen in graph_fusions from within the standalone drivers. I don't know if this is because of NaN/Infs that are filled in the graph tests vs the regular tests.

Here is the log for one failing case:

$ ONEDNN_VERBOSE=profile_exec ./build/tests/benchdnn/benchdnn --graph --dt=f16 --case=complex_fusion/mlp/gated-mlp-f32.json
onednn_verbose,v1,info,oneDNN v3.11.0 (commit f1002304690d09bb9e96ceaf0bb1b018a5e91027)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:64
onednn_verbose,v1,info,cpu,isa:AArch64 SVE (128 bits)
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ba::f0 dst:f32::blocked:ab::f0,,,4096x14336,3.06787
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x4096,0.283936
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ba::f0 dst:f32::blocked:ab::f0,,,4096x14336,1.78003
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x4096,0.371094
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336,0.0090332
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336,0.00805664
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336,0.0090332
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336,0.00805664
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336,0.00805664
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ba::f0 dst:f32::blocked:ab::f0,,,14336x4096,1.93408
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336,0.329834
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f32::blocked:ab::f0 wei:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x4096:4096x14336,12.8149
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f32::blocked:ab::f0 wei:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x4096:4096x14336,1.64502
onednn_verbose,v1,primitive,exec,cpu,eltwise,jit:sve_128,forward_training,data:f32::blocked:ab::f0,,alg:eltwise_logistic alpha:0 beta:0,1x14336,0.0109863
onednn_verbose,v1,primitive,exec,cpu,binary,jit:uni,undef,src:f32::blocked:ab::f0 src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,alg:binary_mul,1x14336:1x14336,0.0100098
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x14336,0.166016
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f16::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336,0.350098
onednn_verbose,v1,primitive,exec,cpu,binary,jit:uni,undef,src:f32::blocked:ab::f0 src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,alg:binary_mul,1x14336:1x14336,0.437988
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x14336,0.407959
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f16::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336,0.324951
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f32::blocked:ab::f0 wei:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336:14336x4096,4.17993
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x4096,0.166992
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f16::blocked:ab::f0,,,4096x14336,1.02686
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x4096,0.155029
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f16::blocked:ab::f0,,,4096x14336,0.983154
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f16::blocked:ab::f0,,,14336x4096,0.956787
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f16::blocked:ab::f0 wei:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-scratchpad:user,,1x4096:4096x14336,5.55713
onednn_verbose,v1,primitive,exec,cpu,matmul,ref:any,undef,src:f16::blocked:ab::f0 wei:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-scratchpad:user attr-post-ops:eltwise_swish:1+binary_mul:f16:2,,1x4096:4096x14336,21.385
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f16::blocked:ab::f0 wei:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-scratchpad:user,,1x14336:14336x4096,2.05908
onednn_verbose,v1,graph,exec,cpu,100002,matmul_post_ops,fc_gate;swish/sigmoid;swish/multiply;fc_up;mul;fc_down,,in0_f16:0:strided:undef:1x4096:4096s1 in1_f16:1:strided:undef:4096x14336:14336s1 in2_f16:0:strided:undef:1x4096:4096s1 in3_f16:4:strided:undef:4096x14336:14336s1 in4_f16:13:strided:undef:14336x4096:4096s1 out0_f16:14:strided:undef:1x4096:4096s1,fpm:strict,larger_partition_kernel_t,dnnl_backend,36.252
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x4096,0.164062
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f16::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x4096,0.0268555
[   2][0:2] exp_f32:    -90976.3 exp:        -inf got:         inf diff:     inf rdiff:     nan
[   6][0:6] exp_f32:      310121 exp:         inf got:        -inf diff:     inf rdiff:     nan
[   7][0:7] exp_f32:       52180 exp:       52192 got:         inf diff:     inf rdiff:     inf
[   9][0:9] exp_f32:     -583013 exp:        -inf got:         inf diff:     inf rdiff:     nan
[  10][0:10] exp_f32:      801264 exp:         inf got:        -inf diff:     inf rdiff:     nan
[  14][0:14] exp_f32:     -935639 exp:        -inf got:         inf diff:     inf rdiff:     nan
[  15][0:15] exp_f32:      264205 exp:         inf got:        -inf diff:     inf rdiff:     nan
[  16][0:16] exp_f32:     -241262 exp:        -inf got:         inf diff:     inf rdiff:     nan
[  17][0:17] exp_f32:     -697940 exp:        -inf got:         inf diff:     inf rdiff:     nan
[  19][0:19] exp_f32:-1.47787e+06 exp:        -inf got:         inf diff:     inf rdiff:     nan
[COMPARE_STATS]: trh=0 err_max_diff:     inf err_max_rdiff:     nan all_max_diff:     inf all_max_rdiff:     inf
[COMPARE_STATS] Norm check is prohibited; error_to_total_ratio: 2178/4096; allowed_ratio: 4/4096;
Error: Function 'doit' at (tests/benchdnn/graph/graph.cpp:754) returned '1'
0:FAILED (errors:2178 total:4096) (618 ms) __REPRO: --graph --dt=f16 --case=complex_fusion/mlp/gated-mlp-f32.json
===========================================================
= Failed cases summary (--summary=no-failures to disable) =
===========================================================
0:FAILED (errors:2178 total:4096) (618 ms) __REPRO: --graph --dt=f16 --case=complex_fusion/mlp/gated-mlp-f32.json
============================
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0
total: 0.62s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 0.00s (0%); execute: 0.02s (3%); compute_ref: 0.00s (0%); compare: 0.00s (0%);

I know that commenting gemm:acl from the matmul impl list resolves this particular failure. So I grep for the invocations of this:

onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f32::blocked:ab::f0 wei:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x4096:4096x14336,12.8579
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f32::blocked:ab::f0 wei:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x4096:4096x14336,2.5
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f32::blocked:ab::f0 wei:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x14336:14336x4096,4.05298
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f16::blocked:ab::f0 wei:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-scratchpad:user,,1x4096:4096x14336,6.3562
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f16::blocked:ab::f0 wei:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-scratchpad:user,,1x14336:14336x4096,1.21997

But turning any of these into their own benchdnn lines will pass. (--matmul --global-impl='gemm:acl' --batch=test_matmul_float16 passes as well).

$ ./build/tests/benchdnn/benchdnn --matmul --global-impl='gemm:acl' --dt=f32,f16 --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=,user 1x4096:4096x14336
0:PASSED (119 ms) __REPRO: --matmul --stag=ab --wtag=ab --dtag=ab --impl=gemm:acl 1x4096:4096x14336
1:PASSED (174 ms) __REPRO: --matmul --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=user --impl=gemm:acl 1x4096:4096x14336
2:PASSED (148 ms) __REPRO: --matmul --dt=f16:f16:f16 --stag=ab --wtag=ab --dtag=ab --impl=gemm:acl 1x4096:4096x14336
3:PASSED (179 ms) __REPRO: --matmul --dt=f16:f16:f16 --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=user --impl=gemm:acl 1x4096:4096x14336
tests:4 passed:4 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0

$ ./build/tests/benchdnn/benchdnn --matmul --global-impl='gemm:acl' --dt=f32,f16 --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=,user 1x14336:14336x4096
0:PASSED (122 ms) __REPRO: --matmul --stag=ab --wtag=ab --dtag=ab --impl=gemm:acl 1x14336:14336x4096
1:PASSED (179 ms) __REPRO: --matmul --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=user --impl=gemm:acl 1x14336:14336x4096
2:PASSED (151 ms) __REPRO: --matmul --dt=f16:f16:f16 --stag=ab --wtag=ab --dtag=ab --impl=gemm:acl 1x14336:14336x4096
3:PASSED (179 ms) __REPRO: --matmul --dt=f16:f16:f16 --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=user --impl=gemm:acl 1x14336:14336x4096
tests:4 passed:4 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0

Since I am unfamiliar with the graph path I would greatly appreciate any clues about what the complex_fusion cases might do very differently from the --matmul driver or graph_tests_float16 that is triggering the failure only here. Thank you!

@mgouicem, @dzarukin your wisdom would be much appreciated.

Sqvid avatar Oct 14 '25 15:10 Sqvid

I tried modifying this line in the benchdnn --matmul buffer fill function: https://github.com/uxlfoundation/oneDNN/blob/aed0c1a497b09f8b97b7932bc7afa8e2c13c3953/tests/benchdnn/matmul/matmul.cpp#L476

I changed this to:

float val = 65500.f; 

to try and trigger an overflow. But running with benchdnn --matmul -v99 still passes and shows matching overflow behaviour:

create: --matmul --dt=f16:f16:f16 --stag=ab --wtag=ab --dtag=ab --impl=gemm:acl 2x2:2x2
oneDNN implementation: gemm:acl
CPU reference oneDNN implementation: gemm:acl
run: --matmul --dt=f16:f16:f16 --stag=ab --wtag=ab --dtag=ab --impl=gemm:acl 2x2:2x2
[FILL_CFG] SRC_f16=[-4;4]; WEI_f16=[-2;2]; DST_f16=[-4;4];
onednn_verbose,v1,info,oneDNN v3.11.0 (commit 92ad568522de34729b44a9a07d9493f313cdcb5b)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:64
onednn_verbose,v1,info,cpu,isa:AArch64 SVE (128 bits)
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,2x2,0.00195312
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ba::f0 dst:f16::blocked:ab::f0,,,2x2,0.000976562
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ba::f0 dst:f32:p:blocked:Ba4b::f0,,,2x2,0.000976562
[FILL_CFG] safe_n_acc=256 density=1.000000
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f16::blocked:ab::f0,,,2x2,0
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,2x2,0
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f16::blocked:ab::f0 wei:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,2x2:2x2,1.56909
run ref: --matmul 2x2:2x2
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f32:a:blocked:ab::f0 wei:f32:ap:blocked:Ba4b::f0 dst:f32:a:blocked:ab::f0,,,2x2:2x2,0.0288086
[COMPARE][DST]: zero_trust%=90.00% extra=has_prim_ref:true;
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f16::blocked:ab::f0 dst:f32::blocked:ab::f0,,,2x2,0.0012207
[   0][DST][0:0] exp_f32: 4.29077e+09 exp:         inf got:         inf diff:     nan rdiff:     nan
[   1][DST][0:1] exp_f32: 4.29084e+09 exp:         inf got:         inf diff:     nan rdiff:     nan
[   2][DST][1:0] exp_f32: 4.29091e+09 exp:         inf got:         inf diff:     nan rdiff:     nan
[   3][DST][1:1] exp_f32: 8.58155e+09 exp:         inf got:         inf diff:     nan rdiff:     nan
[COMPARE_STATS][DST]: trh=0 err_max_diff:       0 err_max_rdiff:       0 all_max_diff:       0 all_max_rdiff:       0
[COMPARE_TRUST][DST]: z: 0% (>90%) (z: 0, total: 4)
0:PASSED (4 ms) __REPRO: --matmul --dt=f16:f16:f16 --stag=ab --wtag=ab --dtag=ab --impl=gemm:acl 2x2:2x2

Sqvid avatar Oct 14 '25 15:10 Sqvid

Hi @Sqvid, here's the WithDoom... :) Graph propagates data between ops which accumulates and may lead to results observed. I compared to x64 and noticed that for ACL it flips the inf sign for some reason (points 2 and 6). And point 7 is incorrect, like because it accumulates into f16 instead of f16, reaches inf and sticks to it, while f32 get its result reduced through negative additions and finally remains a meaningful f16 value.

You'll need to play with data to trigger such effects on ACL side, but I'd suggest to change cfg border values for f16 instead, it's more transparent, just use longer accumulation chains.

dzarukin avatar Oct 14 '25 20:10 dzarukin

@dzarukin Thank you for the insights. I think you're right with the long chains causing this issue. Though the underlying ACL matmul kernel is accumulating to f32 here.

If we look at the following pair of lines from the verbose output:

onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:acl,undef,src:f32::blocked:ab::f0 wei:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,1x4096:4096x14336,1.64502
onednn_verbose,v1,primitive,exec,cpu,eltwise,jit:sve_128,forward_training,data:f32::blocked:ab::f0,,alg:eltwise_logistic alpha:0 beta:0,1x14336,0.0109863

It looks like we are doing matmul_f16_with_f32_acc(src = 1x4096, wei = 4096x14336, dst = 1x14336) and then calling an eltwise_logistic(src = 1x14336) on it. So does x64 do the logistic function on the f32 accumulation buffer before downcasting and outputting in f16? That would explain the difference as ACL would downcast to f16 first (possibly producing infs) and then apply the logistic.

Sqvid avatar Oct 15 '25 09:10 Sqvid

So does x64 do the logistic function on the f32 accumulation buffer before downcasting and outputting in f16?

It's a programming model promise that all post-ops are done on f32 unless acc_mode isn't strict. And I think we had this problem with applying post-ops on top of down-converted values from acc before...

dzarukin avatar Oct 15 '25 15:10 dzarukin

Thanks @dzarukin, I'll look into it. There's several other classes of bugs hiding behind the graph_fusions failure. I'll do some more triaging and then share some details here.

Sqvid avatar Oct 16 '25 09:10 Sqvid