AMDMIGraphX Add GPU onnx support for com.microsoft.SparseAttention

Should be reviewed after https://github.com/ROCm/AMDMIGraphX/pull/3866 is merged.
Partially resolves https://github.com/migraphx-benchmark/AMDMIGraphX/issues/200
Adds GPU support for SparseAttention using simple handmade kernels, will be replaced by an implementation that uses a composite of existing, mlir compatible, operators.

Apr 09 '25 12:04 music-dino

Codecov Report

Attention: Patch coverage is 97.98271% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/include/migraphx/op/sparse_attention.hpp	97.51%	6 Missing :warning:
src/onnx/parse_sparse_attention.cpp	99.06%	1 Missing :warning:

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3938      +/-   ##
===========================================
- Coverage    92.41%   92.11%   -0.30%     
===========================================
  Files          520      527       +7     
  Lines        22471    24492    +2021     
===========================================
+ Hits         20766    22560    +1794     
- Misses        1705     1932     +227

Files with missing lines	Coverage Δ
src/onnx/parse_sparse_attention.cpp	`99.06% <99.06%> (ø)`
src/include/migraphx/op/sparse_attention.hpp	`97.51% <97.51%> (ø)`

... and 367 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Apr 09 '25 14:04 codecov[bot]

Test Batch Rate new
ef52b6 Rate old
b4ba1c Diff Compare

torchvision-resnet50 64 3,252.02 3,237.47 0.45% :white_check_mark:

torchvision-resnet50_fp16 64 6,896.83 6,902.15 -0.08% :white_check_mark:

torchvision-densenet121 32 2,443.44 2,446.09 -0.11% :white_check_mark:

torchvision-densenet121_fp16 32 4,191.72 4,216.06 -0.58% :white_check_mark:

torchvision-inceptionv3 32 1,622.42 1,618.57 0.24% :white_check_mark:

torchvision-inceptionv3_fp16 32 2,698.87 2,708.65 -0.36% :white_check_mark:

cadene-inceptionv4 16 753.10 756.80 -0.49% :white_check_mark:

cadene-resnext64x4 16 814.31 814.70 -0.05% :white_check_mark:

slim-mobilenet 64 6,685.83 7,429.71 -10.01% :red_circle:

slim-nasnetalarge 64 197.68 216.83 -8.83% :red_circle:

slim-resnet50v2 64 3,443.84 3,439.57 0.12% :white_check_mark:

bert-mrpc-onnx 8 1,149.93 1,138.89 0.97% :white_check_mark:

bert-mrpc-tf 1 487.04 455.55 6.91% :high_brightness:

pytorch-examples-wlang-gru 1 479.23 482.14 -0.60% :white_check_mark:

pytorch-examples-wlang-lstm 1 450.68 440.04 2.42% :white_check_mark:

torchvision-resnet50_1 1 811.73 807.21 0.56% :white_check_mark:

cadene-dpn92_1 1 430.18 423.90 1.48% :white_check_mark:

cadene-resnext101_1 1 393.02 391.91 0.28% :white_check_mark:

onnx-taau-downsample 1 372.15 393.08 -5.32% :red_circle:

dlrm-criteoterabyte 1 31.92 32.20 -0.85% :white_check_mark:

dlrm-criteoterabyte_fp16 1 51.13 51.28 -0.30% :white_check_mark:

agentmodel 1 8,769.70 9,435.83 -7.06% :red_circle:

unet_fp16 2 58.42 58.49 -0.12% :white_check_mark:

resnet50v1_fp16 1 1,044.26 1,078.85 -3.21% :red_circle:

resnet50v1_int8 1 823.75 1,064.61 -22.62% :red_circle:

bert_base_cased_fp16 64 1,171.04 1,163.36 0.66% :white_check_mark:

bert_large_uncased_fp16 32 363.39 354.52 2.50% :white_check_mark:

bert_large_fp16 1 201.29 194.65 3.41% :high_brightness:

distilgpt2_fp16 16 2,224.08 2,218.94 0.23% :white_check_mark:

yolov5s 1 520.38 536.97 -3.09% :red_circle:

tinyllama 1 43.75 43.62 0.31% :white_check_mark:

vicuna-fastchat 1 44.10 43.96 0.30% :white_check_mark:

whisper-tiny-encoder 1 413.04 420.49 -1.77% :white_check_mark:

whisper-tiny-decoder 1 412.21 410.31 0.46% :white_check_mark:

llama2_7b 1 nan nan nan% :x:

qwen1.5-7b 1 23.55 23.39 0.67% :white_check_mark:

phi3-3.8b 1 nan nan nan% :x:

mask-rcnn 1 18.54 18.51 0.20% :white_check_mark:

llama3-8b 1 21.23 21.64 -1.89% :white_check_mark:

whisper-large-encoder 1 10.22 10.18 0.40% :white_check_mark:

whisper-large-decoder 1 92.53 97.70 -5.29% :red_circle:

mistral-7b 1 23.77 23.64 0.54% :white_check_mark:

FLUX.1-schnell 1 840.74 901.27 -6.72% :red_circle:

nan nan nan nan nan% :x:

Test	Batch	Rate new ef52b6	Rate old b4ba1c	Diff	Compare
torchvision-resnet50	64	3,252.02	3,237.47	0.45%	:white_check_mark:
torchvision-resnet50_fp16	64	6,896.83	6,902.15	-0.08%	:white_check_mark:
torchvision-densenet121	32	2,443.44	2,446.09	-0.11%	:white_check_mark:
torchvision-densenet121_fp16	32	4,191.72	4,216.06	-0.58%	:white_check_mark:
torchvision-inceptionv3	32	1,622.42	1,618.57	0.24%	:white_check_mark:
torchvision-inceptionv3_fp16	32	2,698.87	2,708.65	-0.36%	:white_check_mark:
cadene-inceptionv4	16	753.10	756.80	-0.49%	:white_check_mark:
cadene-resnext64x4	16	814.31	814.70	-0.05%	:white_check_mark:
slim-mobilenet	64	6,685.83	7,429.71	-10.01%	:red_circle:
slim-nasnetalarge	64	197.68	216.83	-8.83%	:red_circle:
slim-resnet50v2	64	3,443.84	3,439.57	0.12%	:white_check_mark:
bert-mrpc-onnx	8	1,149.93	1,138.89	0.97%	:white_check_mark:
bert-mrpc-tf	1	487.04	455.55	6.91%	:high_brightness:
pytorch-examples-wlang-gru	1	479.23	482.14	-0.60%	:white_check_mark:
pytorch-examples-wlang-lstm	1	450.68	440.04	2.42%	:white_check_mark:
torchvision-resnet50_1	1	811.73	807.21	0.56%	:white_check_mark:
cadene-dpn92_1	1	430.18	423.90	1.48%	:white_check_mark:
cadene-resnext101_1	1	393.02	391.91	0.28%	:white_check_mark:
onnx-taau-downsample	1	372.15	393.08	-5.32%	:red_circle:
dlrm-criteoterabyte	1	31.92	32.20	-0.85%	:white_check_mark:
dlrm-criteoterabyte_fp16	1	51.13	51.28	-0.30%	:white_check_mark:
agentmodel	1	8,769.70	9,435.83	-7.06%	:red_circle:
unet_fp16	2	58.42	58.49	-0.12%	:white_check_mark:
resnet50v1_fp16	1	1,044.26	1,078.85	-3.21%	:red_circle:
resnet50v1_int8	1	823.75	1,064.61	-22.62%	:red_circle:
bert_base_cased_fp16	64	1,171.04	1,163.36	0.66%	:white_check_mark:
bert_large_uncased_fp16	32	363.39	354.52	2.50%	:white_check_mark:
bert_large_fp16	1	201.29	194.65	3.41%	:high_brightness:
distilgpt2_fp16	16	2,224.08	2,218.94	0.23%	:white_check_mark:
yolov5s	1	520.38	536.97	-3.09%	:red_circle:
tinyllama	1	43.75	43.62	0.31%	:white_check_mark:
vicuna-fastchat	1	44.10	43.96	0.30%	:white_check_mark:
whisper-tiny-encoder	1	413.04	420.49	-1.77%	:white_check_mark:
whisper-tiny-decoder	1	412.21	410.31	0.46%	:white_check_mark:
llama2_7b	1	nan	nan	nan%	:x:
qwen1.5-7b	1	23.55	23.39	0.67%	:white_check_mark:
phi3-3.8b	1	nan	nan	nan%	:x:
mask-rcnn	1	18.54	18.51	0.20%	:white_check_mark:
llama3-8b	1	21.23	21.64	-1.89%	:white_check_mark:
whisper-large-encoder	1	10.22	10.18	0.40%	:white_check_mark:
whisper-large-decoder	1	92.53	97.70	-5.29%	:red_circle:
mistral-7b	1	23.77	23.64	0.54%	:white_check_mark:
FLUX.1-schnell	1	840.74	901.27	-6.72%	:red_circle:
nan	nan	nan	nan	nan%	:x:

This build is not recommended to merge :red_circle:

Apr 16 '25 17:04 migraphx-bot

:white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

:x:bert-mrpc-tf: ERROR - check error output

2025-04-16 11:23:45.578795: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1744820630.942813 153309 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 62974 MB memory: -> device: 0, name: AMD Instinct MI250X/MI250, pci bus id: 0000:b3:00.0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1744820631.793843 153309 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
2025-04-16 11:24:01.166645: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-04-16 11:24:01.166814: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-04-16 11:24:01.166843: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-04-16 11:24:01.166889: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-04-16 11:24:01.166938: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-04-16 11:24:01.166984: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-04-16 11:24:01.167034: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
2025-04-16 11:24:01.167064: E external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:250] bitcode module is required by this HLO module but was not found at ./opencl.bc
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
error: Failure when generating HSACO
2025-04-16 11:24:01.168436: E tensorflow/compiler/mlir/tools/kernel_gen/tf_framework_c_interface.cc:228] INTERNAL: Generating device code failed.
2025-04-16 11:24:01.169605: W tensorflow/core/framework/op_kernel.cc:1829] UNKNOWN: JIT compilation failed.
2025-04-16 11:24:01.169624: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
2025-04-16 11:24:01.169635: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
[[import/loss/output/_21]]
2025-04-16 11:24:01.169649: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11217777527359497193
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1407, in _do_call
return fn(*args)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1390, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1483, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
[[import/loss/output/_21]]
(1) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 324, in main
y_out = sess.run(y, feed_dict=tf_dict)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 977, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1220, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1400, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/client/session.py", line 1426, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'import/bert/embeddings/LayerNorm/moments/SquaredDifference' defined at (most recent call last):
Node: 'import/bert/embeddings/LayerNorm/moments/SquaredDifference'
Detected at node 'import/bert/embeddings/LayerNorm/moments/SquaredDifference' defined at (most recent call last):
Node: 'import/bert/embeddings/LayerNorm/moments/SquaredDifference'
2 root error(s) found.
(0) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
[[import/loss/output/_21]]
(1) UNKNOWN: JIT compilation failed.
[[{{node import/bert/embeddings/LayerNorm/moments/SquaredDifference}}]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'import/bert/embeddings/LayerNorm/moments/SquaredDifference':

:white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

:white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

:white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance

:white_check_mark: unet: PASSED: MIGraphX meets tolerance

:white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance

:white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark: bert_large: PASSED: MIGraphX meets tolerance

:white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance

:white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance

:white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

:white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

:x:llama2_7b: ERROR - check error output

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:264: parse_from: PARSE_FROM: Failed reading onnx file: /new-saved-models/llama2_7b/decoder_model.onnx

:x:#qwen1.5-7b: ERROR - check error output

usage: accuracy_checker.py [-h] [--onnx ONNX] [--tf TF] [--provider PROVIDER]
[--batch BATCH] [--fill1] [--fill0] [--fp16]
[--argmax] [--verbose] [--tolerance TOLERANCE]
[--input-dim INPUT_DIM] [--target TARGET]
[--ort-run] [--ort-logging]
[--disable-offload-copy] [--disable-fast-math]
[--exhaustive_tune]
accuracy_checker.py: error: unrecognized arguments: input_ids attention_mask position_ids 1 256 @attention_mask 1 256 @position_ids 1 256

:x:phi3-3.8b: ERROR - check error output

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:264: parse_from: PARSE_FROM: Failed reading onnx file: /new-saved-models/phi3-3.8b/model.onnx

:red_circle:mask-rcnn: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark: llama3-8b: PASSED: MIGraphX meets tolerance

:x:#whisper-large-encoder: ERROR - check error output

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/include/migraphx/op/convolution.hpp:100: normalize_compute_shape: CONVOLUTION: mismatched channel numbers

:white_check_mark: whisper-large-decoder: PASSED: MIGraphX meets tolerance

:white_check_mark: mistral-7b: PASSED: MIGraphX meets tolerance

:white_check_mark: FLUX.1-schnell: PASSED: MIGraphX meets tolerance

Apr 16 '25 17:04 migraphx-bot

AMDMIGraphX AMDMIGraphX copied to clipboard

Add GPU onnx support for com.microsoft.SparseAttention

Codecov Report

AMDMIGraphX
AMDMIGraphX copied to clipboard