VkFFT
VkFFT copied to clipboard
Vulkan_FFT crashes the GPU
Dear Dmitrij,
Thank you developing an interesting application. It looks very promising. I built it from sources. When I ran Vulkan_FFT, it showed rather small computation time ... till it crashed the card around test 17. The monitor became black and I had to reboot the host. Here are messages in the kernel log:
May 9 17:43:45 xxxxxxxx kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, but soft recovered May 9 17:43:48 xxxxxxxx kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=1265969, emitted seq=1265970 May 9 17:43:48 xxxxxxxx kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Vulkan_FFT pid 1867631 thread Vulkan_FFT pid 1867631 May 9 17:43:48 xxxxxxxx kernel: amdgpu 0000:21:00.0: amdgpu: GPU reset begin! May 9 17:43:48 xxxxxxxx kernel: amdgpu 0000:21:00.0: amdgpu: BACO reset May 9 17:43:48 xxxxxxxx kernel: amdgpu 0000:21:00.0: amdgpu: GPU reset succeeded, trying to resume May 9 17:43:48 xxxxxxxx kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000). May 9 17:43:48 xxxxxxxx kernel: [drm] VRAM is lost due to GPU reset!
Then the OS tries to restart the driver, but fails.
/dist/VkFFT-master/build/Vulkan_FFT -h VkFFT v1.2.1 (26-04-2021). Author: Tolmachev Dmitrii Vulkan backend -h: print help
Here is some information from clinfo:
Device Name gfx804 Device Board Name (AMD) Radeon RX550/550 Series Global memory size 4080807936 (3.801GiB)
I ran gputest_gui.py giu. It passed. I ran fft using AMD rocfft via hipfort -- it worked, though probably not as fast as your code (single precision 2D FFT 8192x8192 consumed 0.56s including coping data to and from the GPU).
Does this ring any bell for you? Any other information that might help to debug?
Sincerely, Leonid 2021.05.09_19:55:30
Hello, Sorry that you have encountered a system crash. In order to understand the source, I will need to know the exact FFT system it occurred. It would be nice if you could pinpoint that out with -benchmark_vkfft option, which runs the test on a specific system (see the syntax of it in -h command). Also, have you tried HIP and OpenCL backends of VkFFT on your GPU? Does the error occur on them? Best regards, Dmitrii
Dmitrij,
It crashed when it is called with option -vkfft 0
Also, have you tried HIP and OpenCL backends of VkFFT on your GPU?
How to build VkFFT with HIP and OpenCL backend? Is the there a detailed documentation how to install VkFFT? By the way, what does phrase "HIP backend" and "OpenCL backend" mean? Is there a document that explains it?
Thank you in advance, Leonid 2021.05.15_10:29:07
I was talking about the exact FFT system that it crashed on. -vkfft 0 tests many of them and determining which one is crashing can be very helpful.
Regarding HIP and OpenCL backend there is a variable in CMakeLists called VKFFT_BACKEND which is responsible for selecting one. Information on how to install VkFFT can be found on the main page of the VkFFT repository on GitHub in the Installation section.
HIP backend means that VkFFT will generate HIP syntax kernels and launch them under the ROCm ecosystem. With OpenCL kernels will be made in OpenCL syntax and launched as a part of the OpenCL ecosystem.
Best regards, Dmitrii
Dear Dmitri,
I had to add mdgpu.gpu_recovery=1 amdgpu.lockup_timeout=3000 to the kernel command line in order to cope with GPU crashes. When I boot with this option, GPU crash stops X11, but I am still able to log in to console (via Ctrl/Alt/F2), run killall -9 Xorg, then Xorg restarts automatically. This allows to me to test VkFFT without reboot.
It took a fair amount of time to guess how to build VkFFT with HIP. I did not realize Cmake does not accept options and one needs edit CMakeLists.txt in order to configure VkFFT.
svn co https://github.com/DTolm/VkFFT
cd VkFFT/tags/v1.2.2
patch -Np0 -i /patches/VkFFT_20210522.patch
mkdir build
cd build
cmake ..
sed -i -e "s@-LHIP_CLANG_INCLUDE_PATH-NOTFOUND/../lib/linux -lclang_rt.builtins-x86_64@ @g" CMakeFiles/Vulkan_FFT.dir/link.txt
make -j 64
I cannot run any tests with vkfft:
time /dist/VkFFT-1.2.2/tags/v1.2.2/build/Vulkan_FFT -benchmark_vkfft -X 8192 -Y 8192 -P 0 -B 1 -N 10 -R2C 0
it stops with message
0 - VkFFT FFT + iFFT C2C benchmark 1D batched in single precision
hiprtcCompileProgram error: HIPRTC_ERROR_COMPILATION
I have file /opt/rocm/hip/bin/include/__clang_hip_runtime_wrapper.h in my system, but the bench does not find it. Possibly it is related to an error in CMakeFiles/Vulkan_FFT.dir/link.txt . I was unable to guess how to fix it.
I was able to Vulkan_FFT -rocfft 0; Vulkan_FFT -rocfft 3; Vulkan_FFT -rocfft 6;
/dist/VkFFT-1.2.2/tags/v1.2.2/build/Vulkan_FFT -benchmark_rocfft -X 16000 -Y 16000 -Z 1 -P 0 -B 1 -N 1 -R2C 0
runs, but when I increase both -X and -Y to 16001, it crashes with message
Memory access fault by GPU node-1 (Agent handle: 0x1853770) on address 0xfa040000. Reason: Page not present or supervisor privilege. Abort (core dumped)
runs, but when I use both -X and -Y 16384 or greater, it crashes the GPU. :-(
The reason is not clear for me. A complex matrix 16384x16384 is 2 Gb. Is there a 32-bit code somewhere? My GPU has 4Gb:
Global memory size 4165656576 (3.88GiB) Global free memory (AMD) 4042392 (3.855GiB) Max 2D image size 16384x16384 pixels Max 3D image size 2048x2048x2048 pixels
Leonid 2021.05.22_11:33:46 VkFFT_20210522_patch.txt
Possibly related:
On my 2017 MacBook Pro, running pyvkfft/examples/benchmark.py crashes the kernel when using the Intel Graphics GPU after dim 16 x 270 x 270 (using the Radeon 555 Pro GPU works fine):
Selected OpenCL device: Intel(R) HD Graphics 630 [Apple]
Gbytes/s and time given for a couple (FFT, iFFT), dtype=complex64
16 x N x N [2D] vkFFT.opencl gpyfft[clFFT]
/Users/yves/miniconda3/lib/python3.8/site-packages/pyopencl-2021.1.1-py3.8-macosx-10.9-x86_64.egg/pyopencl/__init__.py:266: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
warn("Non-empty compiler output encountered. Set the "
16 x 16 x 16 2.26 [108.20 µs] 2.64 [ 92.64 µs] [nb=1000]
16 x 18 x 18 3.39 [ 91.26 µs] 3.22 [ 96.00 µs] [nb=1000]
16 x 20 x 20 4.16 [ 91.61 µs] 3.90 [ 97.75 µs] [nb=1000]
16 x 21 x 21 2.29 [183.60 µs] 4.48 [ 93.86 µs] [nb=1000]
16 x 24 x 24 5.17 [106.20 µs] 5.34 [102.90 µs] [nb=1000]
16 x 25 x 25 7.27 [ 81.97 µs] 4.95 [120.44 µs] [nb=1000]
... blah blah blah
16 x 256 x 256 17.78 [ 3.52 ms] 13.20 [ 4.73 ms] [nb= 160]
16 x 270 x 270 11.59 [ 6.00 ms] 8.48 [ 8.20 ms] [nb= 144] --> crash after that one
Message:
Abort trap: 6
I think it is more a Vulkan_FFT bug than a pyvkft one.
Sorry, it's crashing on the gpyfft side. Disabling it allows the test to run OK till the end.
@lpetrov00 There is something wrong with the driver and hip installation, maybe it is best to ask about this on ROCm repo. Still, as I understand, if the Vulkan version worked to some point, at what sequence did it fail exactly?
As for the rocFFT code not working, I have an explanation. All GPU FFT libraries use an additional buffer of system size if they switch to multi-upload FFT - this happens at the 4096-8192 sequence range. Due to this fact, your FFTs after 16384x16384 can crash. As for the 16001 sequence crashing - I don't know what rocFFT is doing there, as this is a different library.
@yves-surrel Glad that the problem was resolved.