AMDMIGraphX icon indicating copy to clipboard operation
AMDMIGraphX copied to clipboard

[Issue]: ~20% perf difference between MIGraphX and rocblas-bench

Open giuseros opened this issue 10 months ago • 7 comments

Problem Description

Given this program:

# test.py
p = migraphx.program()
m = p.get_main_module()
p_a = m.add_parameter("inputA",migraphx.shape(type="half_type", lens=[2,1024,1280]))
p_b = m.add_parameter("inputB",migraphx.shape(type="half_type", lens=[2,1280,10240]))
p_c = m.add_parameter("inputC",migraphx.shape(type="half_type", lens=[2,1024,10240]))
p_dot = m.add_instruction(migraphx.op("dot"), [p_a, p_b])
m.add_instruction(migraphx.op("add"), [p_dot, p_c])

I compile this with:

./bin/migraphx-driver compile --exhaustive-tune  test.py -o test.mxr

And I run with

ROCBLAS_LAYER=2 rocprof --stats  ./bin/migraphx-driver run test.mxr

And the timing I get is ~126us.

If I run perf instead:

ROCBLAS_LAYER=2 rocprof --stats  ./bin/migraphx-driver perf test.mxr

I get a ~110us timing.

Both perf/run will also emit the rocblas-bench command to run the underlying gemm in rocBLAS:

./rocblas-bench -f gemm_strided_batched_ex --transposeA N --transposeB N -m 10240 -n 1024 -k 1280 --alpha 1 --a_type f16_r --lda 10240 --stride_a 13107200 --b_type f16_r --ldb 1280 --stride_b 1310720 --beta 1 --c_type f16_r --ldc 10240 --stride_c 10485760 --d_type f16_r --ldd 10240 --stride_d 10485760 --batch_count 2 --compute_type f32_r --algo 1 --solution_index 80268 --flags 0

But if I execute this, I get 145us (>20% difference). However, if I add the flag --initialization rand_int to rocblas-bench I get 111us.

Operating System

Ubuntu 20.04.6 LTS

CPU

AMD EPYC 7A53 64-Core Processor

GPU

AMD Instinct MI300X

Other

No response

ROCm Version

ROCm 6.0.0

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

giuseros avatar Mar 27 '24 16:03 giuseros

Running the two following commands, the two results are fairly close (Mi300x & Intel(R) Xeon(R) Platinum 8480C) : ROCBLAS_LAYER=2 rocprof --stats ./bin/migraphx-driver run test.mxr (AverageNs: 194 us) ROCBLAS_LAYER=2 rocprof --stats ./bin/migraphx-driver perf -n 1 test.mxr (AverageNs: 188 us -- This is actually an average of 3 runs.)

lakhinderwalia avatar Mar 27 '24 21:03 lakhinderwalia

It's a bit strange we see different timings from migraphx. Did you set HIP_FORCE_DEV_KERNARG=1 ? Also, are you using rocm 6.1 (rocm-6.1.0-445)?

giuseros avatar Mar 28 '24 09:03 giuseros

(Those earlier numbers were not with 6.1. Flag is set as: HIP_FORCE_DEV_KERNARG = 1)

I see that comparing run with perf is likely to result in some confusion. Reason is, perf -n 1 runs at least 3 times, and then give its results. And similar (and better averaged) results with perf, which by default will run 100+ times. Its first run is always the slowest, just to prep up..

While 'run' will just run once, and it naturally has all the delays involved with the prep up in its accounting.

A look at results.csv will show the individual run details of a perf command. And its first run shows virtually the same numbers as a 'run'.

Now, I have just gone on to rocm 6.0.3. From 6.0.2:

Current test sample results: perf -n 10 has its first run @ 187us, the AverageNs at 167us -- While run yielded 187us. Within 0.5 us of the first run of perf!

lakhinderwalia avatar Mar 28 '24 23:03 lakhinderwalia

Running on rocm 6.0.3. rocBLAS version: 4.0.0.88df9726-dirty

  1. On the perf step above, this following output was captured, and run..
rocblas-bench --initialization rand_int -f gemm_strided_batched_ex --transposeA N --transposeB N -m 10240 -n 1024 -k 1280 --alpha 1 --a_type f16_r --lda 10240 --stride_a 13107200 --b_type f16_r --ldb 1280 --stride_b 1310720 --beta 1 --c_type f16_r --ldc 10240 --stride_c 10485760 --d_type f16_r --ldd 10240 --stride_d 10485760 --batch_count 2 --compute_type f32_r --algo 1 --solution_index 40501 --flags 0
Query device success: there are 1 devices
 Device ID 0 :  gfx942:sramecc+:xnack-
with 206.1 GB memory, max. SCLK 2100 MHz, max. MCLK 1300 MHz, compute capability 9.4
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64

rocBLAS info: Time taken to complete rocBLAS library initialization is 1120 milliseconds.

rocBLAS info: maximum library size per device is 0.455082 GB.
transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,batch_count,rocblas-Gflops,us
N,N,10240,1024,1280,1,10240,13107200,1,1280,1310720,10240,10485760,10240,10485760,2, 303317, 177

Results: With rand_int initialization: 177us Without rand_int initialization: 212 us And with the perf command: 168 us.

lakhinderwalia avatar Apr 04 '24 00:04 lakhinderwalia

With a changed random number initialization -- the previous range was very small -- the migraphx perf number moves up from approx. 168 us --> 186 us. The new Results are as follows: rocblas_bench: With rand_int initialization: 177us Without rand_int initialization: 212 us Migraphx: And with the perf command: 186 us.

lakhinderwalia avatar Apr 04 '24 20:04 lakhinderwalia

Also, just to compare a FP32 equivalent model, with the changed random number (range) in migraphx: rocblas_bench: FP32. With rand_int initialization: 492us FP32 Without rand_int initialization: 605 us Migraphx: FP32. With the perf command: 516 us.

lakhinderwalia avatar Apr 04 '24 20:04 lakhinderwalia

The FP32 rocblas_bench command:

# rocblas-bench -f gemm_strided_batched_ex --transposeA N --transposeB N -m 10240 -n 1024 -k 1280 --alpha 1 --a_type f32_r --lda 10240 --stride_a 13107200 --b_type f32_r --ldb 1280 --stride_b 1310720 --beta 1 --c_type f32_r --ldc 10240 --stride_c 10485760 --d_type f32_r --ldd 10240 --stride_d 10485760 --batch_count 2 --compute_type f32_r --algo 1 --solution_index 41916 --flags 0
Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 :  gfx942:sramecc+:xnack-
with 206.1 GB memory, max. SCLK 2100 MHz, max. MCLK 1300 MHz, compute capability 9.4
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 64
-------------------------------------------------------------------------------


rocBLAS info: Time taken to complete rocBLAS library initialization is 519 milliseconds.

rocBLAS info: maximum library size per device is 0.455082 GB.
transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,batch_count,rocblas-Gflops,us
N,N,10240,1024,1280,1,10240,13107200,1,1280,1310720,10240,10485760,10240,10485760,2, 88739, 605

lakhinderwalia avatar Apr 04 '24 20:04 lakhinderwalia