rocFFT Performance ISSUE: Slow performance of rocfft comaped to cufft in MI200 series accelerators vs A100 GPUs

Actual Issue :

I was testing the performance of rocFFT on MI 200 series accelerators GPU vs cuFFT performance on A100 GPUs for single GPU only.

What is the expected behavior

AMD rocFFT should be nearly twice as fast as cuFFT on MI 200 Accelerators GPU vs A100.

What actually happens

We found that rocFFT is actually nearly 3x slower than cuFFT when benchmarked on MI 200 series accelerator GPUs(like MI210 and MI250X frontier) vs cuFFT on A100 GPU.
That is cuFFT on A100 GPU for R2C and C2R with double precision and grid of 512^3 takes only ~11-12 ms, while rocFFT on MI 250X with same R2C and C2R FFT in double precision and ROCM-5.7.1 takes about ~25-28 ms .

How to reproduce

I am enclosing a simple code that i used to Benchmark these FFT. As you will see in this code that when you want to run on AMD GPUs compile with hipcc and it will use rocFFT(hipFFT) and when you use NVCC compiler it uses cuFFT. FFT_testing.zip

Environment

| MI250X and A100 | GPUs |

Timed for AMD on Frontier

ROCm and CUDA	5.7.1 and 12.1 respectively
Library	rocFFT

Mar 05 '24 23:03 manver-iitk

We will check and confirm the numbers and then back to you.

Mar 12 '24 20:03 feizheng10

Hello, Sorry for delayed response. Let's look at system spec comparison first,

	A100	MI250X
HW
Boost Clock (GHz)	1.4	1.7
Compute Core	6912	13312
Memory
Type	HBM2e	HBM2e
Size (GB)	80	128
Bus Width (Bit)	5120	8192
Bandwidth (GB/s)	2.039	3.277
SW
SDK	CUDA12.1	ROCm 6.1
FFT Lib	CuFFT 1.2	RocFFT 1.0.26 RC

Note: MI250 has 2 GCDs. For small cases fit into single GCD, we should always compare A100 to a single GCD on MI250X.

As FFT is mainly a mem-bound workload, comparing 3.277/2 to 2.039, we should expect FFT is slight slower on MI200 single GCD than A100.

With a representative case 256x256256/R2C/double precision/in-place, we optimized from 1149.36 us down to 877.92 us on MI250X with https://github.com/ROCm/rocFFT/commit/35da4ed035ea1ff3f9eddebe65de5e177e8f8838

There might be small room to improve, but still under investigations.

We will keep you updated for optimizations for other cases and hope we can close this ticket as it is.

Apr 29 '24 21:04 feizheng10