FumoTime

Results 2 issues of FumoTime

When running 03-matrix-multiply the performance is much lower compared to rocBLAS ``` M N K rocBLAS Triton 0 1024.0 1024.0 1024.0 21.770480 3.301941 1 2048.0 2048.0 2048.0 25.513268 3.196135 2...

### Problem Description Trying out torch.compile via torch_migraphx and using the example code in torch_migraphx/examples/dynamo/stable_diffusion (but compiling only the unet) does not seem to give a performance increase. Passing in...