Fuse convolution + reshape + transpose + sigmoid
@404 = gpu::code_object[code_object=6464,symbol_name=mlir_convolution_add,global=102400,local=256,](@402,@293,@400,@403) -> half_type, {1, 255, 80, 80}, {1632000, 6400, 80, 1}, target_id=0: 0.0226192ms, 1%
@405 = reshape_lazy[dims={1, 3, 85, 80, 80}](@404) -> half_type, {1, 3, 85, 80, 80}, {1632000, 544000, 6400, 80, 1}, target_id=0: 0.00101232ms, 1%
@406 = transpose[permutation={0, 1, 3, 4, 2}](@405) -> half_type, {1, 3, 80, 80, 85}, {1632000, 544000, 80, 1, 6400}, target_id=0: 0.00064004ms, 1%
@407 = load[offset=4233600,end=7497600](@1) -> half_type, {1, 3, 80, 80, 85}, {1632000, 544000, 80, 1, 6400}, target_id=0: 0.00062424ms, 1%
@408 = gpu::code_object[code_object=9096,symbol_name=sigmoid_kernel,global=816000,local=1024,](@406,@407) -> half_type, {1, 3, 80, 80, 85}, {1632000, 544000, 80, 1, 6400}, target_id=0: 0.0194867ms, 1%
This pattern is from YOLOv5s model.
In this case convolution + add + reshape + transpose sigmoid can be fused but they are not being fused currently.
fuse_mlir pass already does the fusion for conv + pointwise.
https://github.com/ROCm/AMDMIGraphX/blob/6e505cd43c5286f4fafff9b8b863fe94e8735a25/src/targets/gpu/fuse_mlir.cpp#L408
But fuse_pointwise pass can not fuse pointwise modules across the transposes and therefore fuse_mlir is just fusing the convolution + add.
https://github.com/ROCm/AMDMIGraphX/blob/6e505cd43c5286f4fafff9b8b863fe94e8735a25/src/fuse_pointwise.cpp#L181
https://github.com/ROCm/AMDMIGraphX/blob/6e505cd43c5286f4fafff9b8b863fe94e8735a25/src/fuse_pointwise.cpp#L202
This issue can be same as https://github.com/ROCm/AMDMIGraphX/issues/2822 or https://github.com/ROCm/AMDMIGraphX/issues/2813
@pfultz2 Do you think, this needs to be handled inside pointwise fusion instead of fuse_mlir pass ?
https://github.com/ROCm/AMDMIGraphX/pull/3280
Doesn't solve this because "transpose" is between two pointwise ops.
fuse_pointwise wouldn't fuse two pointwises across transposes.
Would it be possible somehow to fuse pointwises across transposes ?
else i find it better to run fuse_mlir pass first and then fuse_pointwise_reduce
Would it be possible somehow to fuse pointwises across transposes ?
Yes, we need to extend the rewrite_reshapes to handle that by updating the axis map with the new permutation.
Okay. Is following is the tracking issue ? https://github.com/ROCm/AMDMIGraphX/issues/2895