MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

Implement RoIAlign

Open anhskrttt opened this issue 10 months ago • 4 comments

  • Add RoIAlign operation [ref] for both forward and backward.
  • Add driver and gtest.
  • Performance condition:
    • For RoIAlignForward, MIOpen performs better if input and output tensors are contiguous.
      • ROCm pytorch's RoIAlign doesn't support for bfloat16 yet, so this is an always winning case.
    • For RoIAlignBackward, MIOpen performs better if:
      • Input tensor is float16 type
      • (Input tensors are contiguous) or (Input tensors are noncontiguous but input_numel size is smaller)

Average improvement over ROCm

type fwd bwd
float 6.24 0.97
float16 6.32 1.21

Detail Benchmark

fp32_fwd
input_shape rois_shape output_shape is_contiguous spatial_scale sampling_ratio ROCm MIOpen improvement
[16 256 160 160] [6 5] [7 7] TRUE 0.0625 -1 838074 11022 76.04
[6 256 200 320] [6 5] [7 7] TRUE 0.0625 -1 738074 13866 53.23
[16 256 160 160] [6 5] [7 7] TRUE 0.25 2 847994 16337 51.91
[6 256 200 320] [6 5] [7 7] TRUE 0.25 -1 755995 33812 22.36
[16 256 160 160] [6 5] [28 28] TRUE 0.0625 -1 875194 66168 13.23
[6 256 200 320] [6 5] [28 28] TRUE 0.25 -1 780314 76088 10.26
[16 256 160 160] [6 5] [28 28] TRUE 0.25 2 905273 112017 8.08
[6 256 200 320] [6 5] [28 28] TRUE 0.0625 2 797754 111875 7.13
[16 256 160 160] [512 5] [7 7] TRUE 0.0625 -1 1028473 347748 2.96
[6 256 200 320] [512 5] [7 7] TRUE 0.0625 2 1098552 575622 1.91
[16 256 160 160] [512 5] [7 7] TRUE 1 2 1881266 1121320 1.68
[1 1 64 64] [512 5] [7 7] TRUE 0.25 -1 16960 12817 1.32
[6 1 800 1060] [512 5] [7 7] TRUE 0.25 2 20480 14684 1.39
[6 256 200 320] [512 5] [7 7] TRUE 0.25 -1 1757267 1468540 1.20
[1 3 96 96] [6 5] [28 28] TRUE 0.25 -1 12480 10062 1.24
[6 256 200 320] [2149 5] [7 7] TRUE 0.25 2 2980138 2481650 1.20
[1 1 64 64] [512 5] [7 7] TRUE 0.0625 2 17760 14897 1.19
fp16_fwd
input_shape rois_shape output_shape is_contiguous spatial_scale sampling_ratio ROCm MIOpen improvement
[16 256 160 160] [6 5] [7 7] contiguous 0.0625 -1 773275 10631 72.74
[6 256 200 320] [6 5] [7 7] contiguous 0.0625 -1 687035 12906 53.23
[16 256 160 160] [6 5] [7 7] contiguous 0.0625 2 775515 14950 51.87
[16 256 160 160] [6 5] [7 7] contiguous 0.25 2 784474 15430 50.84
[6 256 200 320] [6 5] [7 7] contiguous 0.0625 2 691834 14577 47.46
[16 256 160 160] [6 5] [7 7] contiguous 1 2 777275 18133 42.87
[6 256 200 320] [6 5] [7 7] contiguous 0.25 2 692795 16248 42.64
[16 256 160 160] [6 5] [7 7] contiguous 0.25 -1 779514 21262 36.66
[6 256 200 320] [6 5] [7 7] contiguous 1 2 704475 21741 32.40
[6 256 200 320] [6 5] [7 7] contiguous 0.25 -1 699675 32070 21.82
[16 256 160 160] [6 5] [28 28] contiguous 0.0625 -1 805754 65155 12.37
[16 256 160 160] [6 5] [28 28] contiguous 0.25 -1 806394 65386 12.33
fp16_bwd
input_size rois_size output_size contiguous spatial_scale sampling_ratio ROCm MIOpen Improvement
[3 1 800 1060] [512 5] [28 28] TRUE 0.25 2 3776773 1101540 3.43
[1 1 800 1201] [512 5] [28 28] TRUE 0.25 2 4223650 1260380 3.35
[2 1 800 1060] [512 5] [28 28] TRUE 0.25 2 2998538 1017500 2.95
[1 1 800 1201] [6 5] [28 28] TRUE 0.0625 -1 1359510 475799 2.86
[3 1 800 1060] [512 5] [28 28] TRUE 1 2 1239991 446023 2.78
[1 1 64 64] [6 5] [28 28] FALSE 0.0625 -1 5180762 1944090 2.66
[1 1 64 64] [6 5] [28 28] TRUE 0.0625 -1 4979164 1887840 2.64
[1 1 64 64] [6 5] [28 28] FALSE 0.25 2 4175810 1677920 2.49
[1 1 800 1201] [512 5] [28 28] TRUE 1 2 1037592 455659 2.28
[6 1 800 1060] [2149 5] [28 28] TRUE 0.25 2 7006829 3170180 2.21
[6 256 200 320] [512 5] [28 28] TRUE 1 2 113196296 68001300 1.66
[6 1 800 1060] [512 5] [28 28] TRUE 0.25 2 1637108 993130 1.65
[1 3 96 96] [512 5] [28 28] FALSE 0.25 -1 11071280 6859840 1.61
[2 1 800 1060] [512 5] [28 28] TRUE 1 2 689115 427890 1.61
[1 1 64 64] [2149 5] [28 28] TRUE 0.25 -1 38307082 23917300 1.60
[1 3 96 96] [2149 5] [28 28] TRUE 0.0625 -1 39848670 25024400 1.59
[1 1 64 64] [2149 5] [28 28] FALSE 0.0625 2 14476410 10476100 1.38
[1 3 96 96] [512 5] [7 7] FALSE 1 2 1018780 743758 1.37
[6 1 800 1060] [2149 5] [28 28] TRUE 0.0625 -1 4765470 3529800 1.35
[1 3 96 96] [512 5] [28 28] TRUE 0.0625 -1 5826490 4406120 1.32
[1 1 64 64] [512 5] [28 28] FALSE 0.25 -1 3054872 2344600 1.30

anhskrttt avatar Feb 10 '25 08:02 anhskrttt

@anhskrttt Do we need a changelog entry for this addition?

amd-jnovotny avatar Feb 10 '25 13:02 amd-jnovotny

Do we need a changelog entry for this addition?

@amd-jnovotny I actually couldn't find any MIOpen documentation specifying which types of additions require a changelog entry...? I noticed that similar ops, such as PReLU and GLU, have been included in the changelog here. Based on that, I think this op could also be added.

cc @long10024070 Do you see any issues with adding this op to the changelog?

anhskrttt avatar Feb 10 '25 13:02 anhskrttt

@anhskrttt Do we need a changelog entry for this addition?

@amd-jnovotny I think editing changelog should be totally handled from AMD side. It is not our decision.

long10024070 avatar Feb 10 '25 14:02 long10024070

@long10024070 thanks for the feedback. @BrianHarrisonAMD Should we add an entry to the MIOpen changelog? Maybe as part of a separate PR? If you can add something, I can review it.

amd-jnovotny avatar Feb 11 '25 20:02 amd-jnovotny

MIOpen is moving to the new monorepo setup and all older unmerged PR's are being closed. Please re-open this as part of the new repo if these changes are still needed.

BradPepersAMD avatar Jul 14 '25 06:07 BradPepersAMD