composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

Correctness and Performance in example_gemm_xdl_fp16

Open QingtaoLi1 opened this issue 1 year ago • 7 comments

Problem:

  1. Wrong results when running example_gemm_xdl_fp16.
  2. On one MI250 GPU, I got only ~110 TFlops using the default GEMM problem size, which is lower than expected. Does this performance meet your expectation?

Reproduction: MI250 GPU server, Ubuntu 20.04.6 LTS, rocm-libs 5.5.0.50500-63~20.04.

git clone https://github.com/ROCmSoftwarePlatform/composable_kernel.git
cd composable_kernel
mkdir build && cd build
cmake                                                                                              \
-D CMAKE_PREFIX_PATH=/opt/rocm                                                                    \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                                                         \
-D CMAKE_CXX_FLAGS="-O3"                                                                          \
-D CMAKE_BUILD_TYPE=Release                                                                       \
-D GPU_TARGETS="gfx908;gfx90a"                                                                    \
..
make -j example_gemm_xdl_fp16
./bin/example_gemm_xdl_fp16 1 2 5

QingtaoLi1 avatar Aug 07 '23 08:08 QingtaoLi1

example_gemm_xdl_fp16 is not used for performance tests. For performance tests, please use ckProfiler https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/develop/profiler

For the correctness issue, could you post your logs? Our CI does not capture the issue.

zjing14 avatar Aug 07 '23 15:08 zjing14

@zjing14 Update the logs:

$ ./bin/example_gemm_xdl_fp16 1 2 5 a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1} b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096} c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1} Perf: 0.990551 ms, 130.078 TFlops, 97.3892 GB/s, DeviceGemm_Xdl_CShuffle<256, 256, 128, 32, 8, 8, 32, 32, 4, 2, 8, 8, 1, 1> LoopScheduler: Default, PipelineVersion: v1 Error: Incorrect results! out[0] != ref[0]: -1.685547 != 1349 Error: Incorrect results! out[1] != ref[1]: -1.685547 != -33.4375 Error: Incorrect results! out[2] != ref[2]: -1.685547 != 8.609375 Error: Incorrect results! out[3] != ref[3]: -1.685547 != -3.066406 max err: 1430.686

Sometimes the "out[i]" may be 0.

QingtaoLi1 avatar Aug 08 '23 03:08 QingtaoLi1

I feel that the key factor may be ROCm version or other environments. The same code would behave differently (including TFLOPs and correctness) on different machines. The two machines I used pre-installed different versions of ROCm (and other supporting tools).

On the same commit (develop branch, the latest 578142db3a9e2bc273c2a178ca3e550ee79e6050 commit when I'm writing this reply), the ROCm-5.4.0 machine runs the example_gemm_xdl_fp16 with correct output, but only ~110T FLOPs. While the ROCm-5.5.0 machine can reach ~150T FLOPs, but got incorrect outputs as the log I put above.

However, after I install ROCm-5.4.0 via amdgpu-install and make CK using:

cmake
-D CMAKE_PREFIX_PATH=/opt/rocm-5.4.0
-D CMAKE_CXX_COMPILER=/opt/rocm-5.4.0/bin/hipcc
-D CMAKE_CXX_FLAGS="-O3"
-D CMAKE_BUILD_TYPE=Release
-D GPU_TARGETS="gfx908;gfx90a"
..

on the ROCm-5.5.0 machine, the behavior keeps unchanged.

QingtaoLi1 avatar Aug 10 '23 08:08 QingtaoLi1

@QingtaoLi1 Could you try latest ROCm 5.6?

zjing14 avatar Aug 11 '23 15:08 zjing14

Yes, but wait until I re-get the access to the machine again... :(

So did you observe similar issues when switching among different ROCm versions or different environments? My hardware is the same, ck code is the same, therefore software environment is probably the reason to this weird gap.


Update: @zjing14 I have confirmed the issue. After compiled by ROCm-5.4.0 (HIP=5.4.22454), 5.5.0 (HIP=5.5.23132) and 5.6.0 (HIP=5.6.23242) on the same machine, example_gemm_xdl_fp16 behaves differently. 5.4.0 and 5.6.0 give correct results with ~100TFLOPs, while 5.5.0 gives wrong results with ~150TFLOPs. CKProfiler behaves similarly, where 5.5.0 reports incorrect output at every instance.

QingtaoLi1 avatar Aug 14 '23 06:08 QingtaoLi1

@QingtaoLi1 Thanks very much. Yes, ROCm 5.5 has some compiler issues.

zjing14 avatar Aug 16 '23 12:08 zjing14

@zjing14 I'm curious whether the issues stem from the ROCm side or the CK side?

QingtaoLi1 avatar Aug 21 '23 03:08 QingtaoLi1

@QingtaoLi1 Apologies for the lack of response. Can you please check if your issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!

ppanchad-amd avatar Aug 21 '24 15:08 ppanchad-amd

@QingtaoLi1 Closing ticket for now. Please feel free to re-open ticket if you still encounter the issue with the latest ROCm and we will further investigate the issue. Thanks!

ppanchad-amd avatar Sep 25 '24 19:09 ppanchad-amd