Performance ISSUE: Slow performance of rocfft comaped to cufft in MI200 series accelerators vs A100 GPUs
Actual Issue :
I was testing the performance of rocFFT on MI 200 series accelerators GPU vs cuFFT performance on A100 GPUs for single GPU only.
What is the expected behavior
- AMD rocFFT should be nearly twice as fast as cuFFT on MI 200 Accelerators GPU vs A100.
What actually happens
- We found that rocFFT is actually nearly 3x slower than cuFFT when benchmarked on MI 200 series accelerator GPUs(like MI210 and MI250X frontier) vs cuFFT on A100 GPU.
- That is cuFFT on A100 GPU for R2C and C2R with double precision and grid of 512^3 takes only ~11-12 ms, while rocFFT on MI 250X with same R2C and C2R FFT in double precision and ROCM-5.7.1 takes about ~25-28 ms .
How to reproduce
- I am enclosing a simple code that i used to Benchmark these FFT. As you will see in this code that when you want to run on AMD GPUs compile with hipcc and it will use rocFFT(hipFFT) and when you use NVCC compiler it uses cuFFT. FFT_testing.zip
Environment
| MI250X and A100 | GPUs |
| Timed for AMD on Frontier |
|---|
| ROCm and CUDA | 5.7.1 and 12.1 respectively |
|---|---|
| Library | rocFFT |
We will check and confirm the numbers and then back to you.
Hello,
Sorry for delayed response.
Let's look at system spec comparison first,
| A100 | MI250X | |
|---|---|---|
| HW | ||
| Boost Clock (GHz) | 1.4 | 1.7 |
| Compute Core | 6912 | 13312 |
| Memory | ||
| Type | HBM2e | HBM2e |
| Size (GB) | 80 | 128 |
| Bus Width (Bit) | 5120 | 8192 |
| Bandwidth (GB/s) | 2.039 | 3.277 |
| SW | ||
| SDK | CUDA12.1 | ROCm 6.1 |
| FFT Lib | CuFFT 1.2 | RocFFT 1.0.26 RC |
Note: MI250 has 2 GCDs. For small cases fit into single GCD, we should always compare A100 to a single GCD on MI250X.
As FFT is mainly a mem-bound workload, comparing 3.277/2 to 2.039, we should expect FFT is slight slower on MI200 single GCD than A100.
With a representative case 256x256256/R2C/double precision/in-place, we optimized from 1149.36 us down to 877.92 us on MI250X with https://github.com/ROCm/rocFFT/commit/35da4ed035ea1ff3f9eddebe65de5e177e8f8838
There might be small room to improve, but still under investigations.
We will keep you updated for optimizations for other cases and hope we can close this ticket as it is.