AMDMIGraphX Remove rocmlir unsupported reduce types

We are going to enable f16 for reduce in this rocmlir PR: https://github.com/ROCm/rocMLIR/pull/1722

However, the code here assumes bf16 and fp8 are supported and that's not the case.

Jan 17 '25 14:01 dhernandez0

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.29%. Comparing base (79f95da) to head (30b200f).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #3766   +/-   ##
========================================
  Coverage    92.29%   92.29%           
========================================
  Files          519      519           
  Lines        22233    22233           
========================================
  Hits         20520    20520           
  Misses        1713     1713

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jan 17 '25 16:01 codecov[bot]

By the way, reduce f16 should not be enabled for navi. As it does not support atomic_add with packed_f16, so it would be slow. Leaving this comment here for whoever is assigned the PR. @causten

Jan 20 '25 15:01 dhernandez0

What's the status on this PR? Is there additional work needed or is the removal of the allowed types the only thing that needs review?

Jan 21 '25 17:01 CharlieL7

What's the status on this PR? Is there additional work needed or is the removal of the allowed types the only thing that needs review?

I thought this was all that needed to be done, so I created the PR. However, we need some tests and make sure reduce f16 is not enabled for Navi (except for gfx12). I think @causten will assign somebody from migraphx team to finish this.

Jan 21 '25 19:01 dhernandez0

Test Batch Rate new
da147a Rate old
879f30 Diff Compare

torchvision-resnet50 64 3,237.78 3,237.49 0.01% :white_check_mark:

torchvision-resnet50_fp16 64 6,875.40 6,885.53 -0.15% :white_check_mark:

torchvision-densenet121 32 2,438.00 2,438.04 -0.00% :white_check_mark:

torchvision-densenet121_fp16 32 4,204.28 4,201.39 0.07% :white_check_mark:

torchvision-inceptionv3 32 1,615.08 1,614.18 0.06% :white_check_mark:

torchvision-inceptionv3_fp16 32 2,690.32 2,690.16 0.01% :white_check_mark:

cadene-inceptionv4 16 750.31 750.10 0.03% :white_check_mark:

cadene-resnext64x4 16 809.54 809.38 0.02% :white_check_mark:

slim-mobilenet 64 6,667.02 6,664.81 0.03% :white_check_mark:

slim-nasnetalarge 64 199.04 199.11 -0.03% :white_check_mark:

slim-resnet50v2 64 3,425.52 3,426.01 -0.01% :white_check_mark:

bert-mrpc-onnx 8 1,145.52 1,145.27 0.02% :white_check_mark:

bert-mrpc-tf 1 476.91 479.30 -0.50% :white_check_mark:

pytorch-examples-wlang-gru 1 480.10 470.08 2.13% :white_check_mark:

pytorch-examples-wlang-lstm 1 432.17 438.03 -1.34% :white_check_mark:

torchvision-resnet50_1 1 813.07 803.96 1.13% :white_check_mark:

cadene-dpn92_1 1 429.56 433.48 -0.90% :white_check_mark:

cadene-resnext101_1 1 390.30 389.71 0.15% :white_check_mark:

onnx-taau-downsample 1 373.05 373.58 -0.14% :white_check_mark:

dlrm-criteoterabyte 1 31.80 31.80 -0.02% :white_check_mark:

dlrm-criteoterabyte_fp16 1 51.05 51.06 -0.01% :white_check_mark:

agentmodel 1 8,707.08 8,764.93 -0.66% :white_check_mark:

unet_fp16 2 58.00 58.01 -0.02% :white_check_mark:

resnet50v1_fp16 1 1,040.62 1,021.79 1.84% :white_check_mark:

resnet50v1_int8 1 787.04 780.83 0.79% :white_check_mark:

bert_base_cased_fp16 64 1,172.83 1,172.44 0.03% :white_check_mark:

bert_large_uncased_fp16 32 362.44 362.45 -0.00% :white_check_mark:

bert_large_fp16 1 200.26 201.10 -0.42% :white_check_mark:

distilgpt2_fp16 16 2,217.99 2,219.84 -0.08% :white_check_mark:

yolov5s 1 512.25 521.90 -1.85% :white_check_mark:

tinyllama 1 43.63 43.62 0.04% :white_check_mark:

vicuna-fastchat 1 43.86 43.83 0.08% :white_check_mark:

whisper-tiny-encoder 1 412.29 412.03 0.06% :white_check_mark:

whisper-tiny-decoder 1 411.72 410.73 0.24% :white_check_mark:

Test	Batch	Rate new da147a	Rate old 879f30	Diff	Compare
torchvision-resnet50	64	3,237.78	3,237.49	0.01%	:white_check_mark:
torchvision-resnet50_fp16	64	6,875.40	6,885.53	-0.15%	:white_check_mark:
torchvision-densenet121	32	2,438.00	2,438.04	-0.00%	:white_check_mark:
torchvision-densenet121_fp16	32	4,204.28	4,201.39	0.07%	:white_check_mark:
torchvision-inceptionv3	32	1,615.08	1,614.18	0.06%	:white_check_mark:
torchvision-inceptionv3_fp16	32	2,690.32	2,690.16	0.01%	:white_check_mark:
cadene-inceptionv4	16	750.31	750.10	0.03%	:white_check_mark:
cadene-resnext64x4	16	809.54	809.38	0.02%	:white_check_mark:
slim-mobilenet	64	6,667.02	6,664.81	0.03%	:white_check_mark:
slim-nasnetalarge	64	199.04	199.11	-0.03%	:white_check_mark:
slim-resnet50v2	64	3,425.52	3,426.01	-0.01%	:white_check_mark:
bert-mrpc-onnx	8	1,145.52	1,145.27	0.02%	:white_check_mark:
bert-mrpc-tf	1	476.91	479.30	-0.50%	:white_check_mark:
pytorch-examples-wlang-gru	1	480.10	470.08	2.13%	:white_check_mark:
pytorch-examples-wlang-lstm	1	432.17	438.03	-1.34%	:white_check_mark:
torchvision-resnet50_1	1	813.07	803.96	1.13%	:white_check_mark:
cadene-dpn92_1	1	429.56	433.48	-0.90%	:white_check_mark:
cadene-resnext101_1	1	390.30	389.71	0.15%	:white_check_mark:
onnx-taau-downsample	1	373.05	373.58	-0.14%	:white_check_mark:
dlrm-criteoterabyte	1	31.80	31.80	-0.02%	:white_check_mark:
dlrm-criteoterabyte_fp16	1	51.05	51.06	-0.01%	:white_check_mark:
agentmodel	1	8,707.08	8,764.93	-0.66%	:white_check_mark:
unet_fp16	2	58.00	58.01	-0.02%	:white_check_mark:
resnet50v1_fp16	1	1,040.62	1,021.79	1.84%	:white_check_mark:
resnet50v1_int8	1	787.04	780.83	0.79%	:white_check_mark:
bert_base_cased_fp16	64	1,172.83	1,172.44	0.03%	:white_check_mark:
bert_large_uncased_fp16	32	362.44	362.45	-0.00%	:white_check_mark:
bert_large_fp16	1	200.26	201.10	-0.42%	:white_check_mark:
distilgpt2_fp16	16	2,217.99	2,219.84	-0.08%	:white_check_mark:
yolov5s	1	512.25	521.90	-1.85%	:white_check_mark:
tinyllama	1	43.63	43.62	0.04%	:white_check_mark:
vicuna-fastchat	1	43.86	43.83	0.08%	:white_check_mark:
whisper-tiny-encoder	1	412.29	412.03	0.06%	:white_check_mark:
whisper-tiny-decoder	1	411.72	410.73	0.24%	:white_check_mark:

This build is OK for merge :white_check_mark:

Jan 30 '25 07:01 migraphx-bot

:white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

:white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

:white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance

:white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

:white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance

:white_check_mark: unet: PASSED: MIGraphX meets tolerance

:white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance

:white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark: bert_large: PASSED: MIGraphX meets tolerance

:white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance

:white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance

:white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

:white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

Jan 30 '25 07:01 migraphx-bot

@richagadgil we recently merged (rocmlir) support for bf16 atomic add for navi4 and gfx950. So, that means we can enable reductions for those to architectures for bf16. To sum up, Navi4 supports f32, bf16 and f16 reductions. Navi3 supports f32 reductions only. I think older Navis don't support reduction. MI series: all of them support f32 reductions (gfx906 doesn't but I don't know if migraphx supports that, as it's old). f16 reductions are supported for: gfx908, gfx90a, gfx94x, gfx950. bf16 reductions gfx950 only.

Maybe we can refactor this in the future to a centralized place or just have a rocmlir API to figure out if reduction is supported? @pfultz2

Feb 28 '25 08:02 dhernandez0

@turneram I think this is still needed?

Mar 07 '25 15:03 dhernandez0

@richagadgil can we reopen this?

Mar 07 '25 16:03 dhernandez0