onnxruntime
onnxruntime copied to clipboard
Enable ROCm to use tunable GEMM
Related PRs #12855 #12856 #12857
Description: Enable ROCm to use tunable GEMM for better performance.
Motivation and Context
- Why is this change required? What problem does it solve? This drastically improve some GEMM performance, aka, the overall performance for bert inference.
For recording purpose, the perf difference with initial try
Latency(ms) Latency_P50 Latency_P75 Latency_P90 Latency_P95 Latency_P99 Throughput(QPS) model graph_optimization_level intra_op_num_threads batch_size sequence_length test_cases test_timesuse_gpu
113.03 113.01 113.15 113.26 113.38 113.53 9059.37 fbv_bert_fp16_rocm_no_attention_fusion.onnx ENABLE_ALL 24 1024 128 10 10 True
94.89 94.88 94.92 94.96 94.98 95.02 10791.95 fbv_bert_fp16_rocm_no_attention_fusion.onnx ENABLE_ALL 24 1024 128 10 10 True
This PR is split into 2, the following #13116 the enabling and testing for it.