AMDMIGraphX
AMDMIGraphX copied to clipboard
[Issue]: ~20% perf difference between MIGraphX and rocblas-bench
Problem Description
Given this program:
# test.py
p = migraphx.program()
m = p.get_main_module()
p_a = m.add_parameter("inputA",migraphx.shape(type="half_type", lens=[2,1024,1280]))
p_b = m.add_parameter("inputB",migraphx.shape(type="half_type", lens=[2,1280,10240]))
p_c = m.add_parameter("inputC",migraphx.shape(type="half_type", lens=[2,1024,10240]))
p_dot = m.add_instruction(migraphx.op("dot"), [p_a, p_b])
m.add_instruction(migraphx.op("add"), [p_dot, p_c])
I compile this with:
./bin/migraphx-driver compile --exhaustive-tune test.py -o test.mxr
And I run with
ROCBLAS_LAYER=2 rocprof --stats ./bin/migraphx-driver run test.mxr
And the timing I get is ~126us.
If I run perf
instead:
ROCBLAS_LAYER=2 rocprof --stats ./bin/migraphx-driver perf test.mxr
I get a ~110us timing.
Both perf/run will also emit the rocblas-bench
command to run the underlying gemm in rocBLAS:
./rocblas-bench -f gemm_strided_batched_ex --transposeA N --transposeB N -m 10240 -n 1024 -k 1280 --alpha 1 --a_type f16_r --lda 10240 --stride_a 13107200 --b_type f16_r --ldb 1280 --stride_b 1310720 --beta 1 --c_type f16_r --ldc 10240 --stride_c 10485760 --d_type f16_r --ldd 10240 --stride_d 10485760 --batch_count 2 --compute_type f32_r --algo 1 --solution_index 80268 --flags 0
But if I execute this, I get 145us (>20% difference). However, if I add the flag --initialization rand_int
to rocblas-bench
I get 111us.
Operating System
Ubuntu 20.04.6 LTS
CPU
AMD EPYC 7A53 64-Core Processor
GPU
AMD Instinct MI300X
Other
No response
ROCm Version
ROCm 6.0.0
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Running the two following commands, the two results are fairly close (Mi300x & Intel(R) Xeon(R) Platinum 8480C) :
ROCBLAS_LAYER=2 rocprof --stats ./bin/migraphx-driver run test.mxr
(AverageNs: 194 us)
ROCBLAS_LAYER=2 rocprof --stats ./bin/migraphx-driver perf -n 1 test.mxr
(AverageNs: 188 us -- This is actually an average of 3 runs.)
It's a bit strange we see different timings from migraphx. Did you set HIP_FORCE_DEV_KERNARG=1
? Also, are you using rocm 6.1 (rocm-6.1.0-445)?
(Those earlier numbers were not with 6.1. Flag is set as: HIP_FORCE_DEV_KERNARG = 1
)
I see that comparing run
with perf
is likely to result in some confusion. Reason is, perf -n 1
runs at least 3 times, and then give its results. And similar (and better averaged) results with perf
, which by default will run 100+ times. Its first run is always the slowest, just to prep up..
While 'run' will just run once, and it naturally has all the delays involved with the prep up in its accounting.
A look at results.csv will show the individual run details of a perf command. And its first run shows virtually the same numbers as a 'run'.
Now, I have just gone on to rocm 6.0.3. From 6.0.2:
Current test sample results: perf -n 10
has its first run @ 187us, the AverageNs at 167us --
While run
yielded 187us. Within 0.5 us of the first run of perf!
Running on rocm 6.0.3. rocBLAS version: 4.0.0.88df9726-dirty
- On the perf step above, this following output was captured, and run..
rocblas-bench --initialization rand_int -f gemm_strided_batched_ex --transposeA N --transposeB N -m 10240 -n 1024 -k 1280 --alpha 1 --a_type f16_r --lda 10240 --stride_a 13107200 --b_type f16_r --ldb 1280 --stride_b 1310720 --beta 1 --c_type f16_r --ldc 10240 --stride_c 10485760 --d_type f16_r --ldd 10240 --stride_d 10485760 --batch_count 2 --compute_type f32_r --algo 1 --solution_index 40501 --flags 0
Query device success: there are 1 devices
Device ID 0 : gfx942:sramecc+:xnack-
with 206.1 GB memory, max. SCLK 2100 MHz, max. MCLK 1300 MHz, compute capability 9.4
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
rocBLAS info: Time taken to complete rocBLAS library initialization is 1120 milliseconds.
rocBLAS info: maximum library size per device is 0.455082 GB.
transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,batch_count,rocblas-Gflops,us
N,N,10240,1024,1280,1,10240,13107200,1,1280,1310720,10240,10485760,10240,10485760,2, 303317, 177
Results: With rand_int initialization: 177us Without rand_int initialization: 212 us And with the perf command: 168 us.
With a changed random number initialization -- the previous range was very small -- the migraphx perf number moves up from approx. 168 us --> 186 us. The new Results are as follows: rocblas_bench: With rand_int initialization: 177us Without rand_int initialization: 212 us Migraphx: And with the perf command: 186 us.
Also, just to compare a FP32 equivalent model, with the changed random number (range) in migraphx: rocblas_bench: FP32. With rand_int initialization: 492us FP32 Without rand_int initialization: 605 us Migraphx: FP32. With the perf command: 516 us.
The FP32 rocblas_bench command:
# rocblas-bench -f gemm_strided_batched_ex --transposeA N --transposeB N -m 10240 -n 1024 -k 1280 --alpha 1 --a_type f32_r --lda 10240 --stride_a 13107200 --b_type f32_r --ldb 1280 --stride_b 1310720 --beta 1 --c_type f32_r --ldc 10240 --stride_c 10485760 --d_type f32_r --ldd 10240 --stride_d 10485760 --batch_count 2 --compute_type f32_r --algo 1 --solution_index 41916 --flags 0
Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : gfx942:sramecc+:xnack-
with 206.1 GB memory, max. SCLK 2100 MHz, max. MCLK 1300 MHz, compute capability 9.4
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------
rocBLAS info: Time taken to complete rocBLAS library initialization is 519 milliseconds.
rocBLAS info: maximum library size per device is 0.455082 GB.
transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,batch_count,rocblas-Gflops,us
N,N,10240,1024,1280,1,10240,13107200,1,1280,1310720,10240,10485760,10240,10485760,2, 88739, 605