composable_kernel
composable_kernel copied to clipboard
Correctness and Performance in example_gemm_xdl_fp16
Problem:
- Wrong results when running example_gemm_xdl_fp16.
- On one MI250 GPU, I got only ~110 TFlops using the default GEMM problem size, which is lower than expected. Does this performance meet your expectation?
Reproduction: MI250 GPU server, Ubuntu 20.04.6 LTS, rocm-libs 5.5.0.50500-63~20.04.
git clone https://github.com/ROCmSoftwarePlatform/composable_kernel.git
cd composable_kernel
mkdir build && cd build
cmake \
-D CMAKE_PREFIX_PATH=/opt/rocm \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-D CMAKE_CXX_FLAGS="-O3" \
-D CMAKE_BUILD_TYPE=Release \
-D GPU_TARGETS="gfx908;gfx90a" \
..
make -j example_gemm_xdl_fp16
./bin/example_gemm_xdl_fp16 1 2 5
example_gemm_xdl_fp16 is not used for performance tests.
For performance tests, please use ckProfiler
https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/develop/profiler
For the correctness issue, could you post your logs? Our CI does not capture the issue.
@zjing14 Update the logs:
$ ./bin/example_gemm_xdl_fp16 1 2 5 a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1} b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096} c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1} Perf: 0.990551 ms, 130.078 TFlops, 97.3892 GB/s, DeviceGemm_Xdl_CShuffle<256, 256, 128, 32, 8, 8, 32, 32, 4, 2, 8, 8, 1, 1> LoopScheduler: Default, PipelineVersion: v1 Error: Incorrect results! out[0] != ref[0]: -1.685547 != 1349 Error: Incorrect results! out[1] != ref[1]: -1.685547 != -33.4375 Error: Incorrect results! out[2] != ref[2]: -1.685547 != 8.609375 Error: Incorrect results! out[3] != ref[3]: -1.685547 != -3.066406 max err: 1430.686
Sometimes the "out[i]" may be 0.
I feel that the key factor may be ROCm version or other environments. The same code would behave differently (including TFLOPs and correctness) on different machines. The two machines I used pre-installed different versions of ROCm (and other supporting tools).
On the same commit (develop branch, the latest 578142db3a9e2bc273c2a178ca3e550ee79e6050 commit when I'm writing this reply), the ROCm-5.4.0 machine runs the example_gemm_xdl_fp16 with correct output, but only ~110T FLOPs. While the ROCm-5.5.0 machine can reach ~150T FLOPs, but got incorrect outputs as the log I put above.
However, after I install ROCm-5.4.0 via amdgpu-install and make CK using:
cmake
-D CMAKE_PREFIX_PATH=/opt/rocm-5.4.0
-D CMAKE_CXX_COMPILER=/opt/rocm-5.4.0/bin/hipcc
-D CMAKE_CXX_FLAGS="-O3"
-D CMAKE_BUILD_TYPE=Release
-D GPU_TARGETS="gfx908;gfx90a"
..
on the ROCm-5.5.0 machine, the behavior keeps unchanged.
@QingtaoLi1 Could you try latest ROCm 5.6?
Yes, but wait until I re-get the access to the machine again... :(
So did you observe similar issues when switching among different ROCm versions or different environments? My hardware is the same, ck code is the same, therefore software environment is probably the reason to this weird gap.
Update: @zjing14 I have confirmed the issue. After compiled by ROCm-5.4.0 (HIP=5.4.22454), 5.5.0 (HIP=5.5.23132) and 5.6.0 (HIP=5.6.23242) on the same machine, example_gemm_xdl_fp16 behaves differently. 5.4.0 and 5.6.0 give correct results with ~100TFLOPs, while 5.5.0 gives wrong results with ~150TFLOPs. CKProfiler behaves similarly, where 5.5.0 reports incorrect output at every instance.
@QingtaoLi1 Thanks very much. Yes, ROCm 5.5 has some compiler issues.
@zjing14 I'm curious whether the issues stem from the ROCm side or the CK side?
@QingtaoLi1 Apologies for the lack of response. Can you please check if your issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!
@QingtaoLi1 Closing ticket for now. Please feel free to re-open ticket if you still encounter the issue with the latest ROCm and we will further investigate the issue. Thanks!