AMDMIGraphX
AMDMIGraphX copied to clipboard
Infinite recursion in module::sort(), speech_transformer
Model speech_transformer in the DLM pytorch performance test causes an exception which is traced to an infinite recursion in the MIGRAPHX method module::sort().
The error is only seen in branch fix_find_pointwise_reduce because in the develop branch, the test exits earlier due to a different bug. The error is not seen in commit efc01466f which was created from develop commit ee68f7261f2 but does occur when I merge with commit 2bdd02d38c41 (may 3). It's not known what commit between those two first introduces the bug.
Steps:
- Set up DLM performance test environment
- Check out branch
fix_find_pointwise_reduceand then merge branchdevelop. - Run the test script
python benchmarks/dynamo/torchbench.py --inference --float16 -dcuda --performance --backend migraphx -k speech_transformer - Multiple
*.mxrmodels are created. Run MIGraphX driver on the first one:bin/driver compile ../../pytorch/fused_0.mxrto see the fail.
I just saw this same failure with the GoogleFnet model. An mxr file created from GoogleFnet is on hyd-7c-ZT09-02.amd.com
Here's a reduced test case:
p = migraphx.program()
m = p.get_main_module()
x_0 = m.add_literal(migraphx.generate_argument(migraphx.shape(type="float_type", lens=[5,784,768]), 0))
x_1 = m.add_literal(migraphx.generate_argument(migraphx.shape(type="float_type", lens=[1]), 1))
p_x = m.add_parameter("x",migraphx.shape(type="float_type", lens=[5,784,768]))
x_3 = m.add_instruction(migraphx.op("reduce_mean", axes=[-1]), [p_x])
x_4 = m.add_instruction(migraphx.op("multibroadcast", out_lens=[5,784,768]), [x_3])
x_5 = m.add_instruction(migraphx.op("sub"), [p_x, x_4])
x_6 = m.add_instruction(migraphx.op("multibroadcast", out_lens=[5,784,768]), [x_1])
x_7 = m.add_instruction(migraphx.op("div"), [x_5, x_6])
x_8 = m.add_instruction(migraphx.op("mul"), [x_7, x_7])
m.add_instruction(migraphx.op("reduce_sum", axes=[-1]), [x_8])