ITK
ITK copied to clipboard
PERF: Leverage speedup over requested region subset in `FFTConvolutionImageFilter`
Description
A performance discrepancy is observed when blurring with FFTDiscreteGaussianImageFilter
in the frequency domain via GPU-based FFTs vs DiscreteGaussianImageFilter
in physical space via convolution with a separable kernel on CPU.
Impact analysis
In subplots below "Vk" represents the GPU implementation and "Vnl" represents the CPU implementation.
In the top row 3D cube images of variable "LargestPossibleRegionSize" are blurred. Filter performance follows a natural, smooth trend as expected.
In the bottom row 3D cube images of fixed "LargestPossibleRegion" size with variable "RequestedOutputRegion" sizes are blurred. GPU FFT convolution plateaus while CPU FFT shows a sharp performance increase over comparable input size in the top row. The result is that GPU FFT convolution offers minimal to no performance boost over CPU implementation for most image region subsets.
Single- vs multi-threaded results included to show that threading does not fully account for difference in performance curves.
It is hypothesized that the performance difference is based on boundary handling in the respective filters. FFTConvolutionImageFilter::GenerateInputRequestedRegion always requests the largest possible input region, so every call processes the same region size irrespective of the requested output. This leads to the plateau observed in the lower row. By contrast, DiscreteGaussianImageFilter
propagates the requested output region so that a subset of the input is actually processed. When the requested region is centered inside the input image, it is possible for boundary pixels to be obtained directly from the surrounding area in the input image rather than generated dynamically from the boundary condition. There is some evidence that the difference in boundary processing is a core issue here, setting the requested region index at (0,0,0) results in lessened CPU blurring performance.
Expected behavior
GPU and CPU blurring performance should follow a similar trend when OutputRequestedRegion
is reduced from LargestPossibleRegion
.
Actual behavior
CPU blurring performance exhibits significant speedup when processing image subsets while GPU blurring performance is independent of the requested region.
Versions
ITK v5.3rc04
Environment
Windows, CMake 3.22
Additional Information
Will investigate and benchmark adjustments to how FFTConvolutionImageFilter sets its requested input region size.
Related to speedup with https://github.com/InsightSoftwareConsortium/ITKVkFFTBackend
Investigation shows that underlying FFT classes also request the largest possible region in each FFT direction, which limits our ability to optimize for image subregions. Exploring the impact of propagating only the requested region for ITK pipeline optimization.
@thewtex Please let me know or update here if you are pursuing a similar investigation so that we can avoid duplicate effort.
Additional notes:
-
itk::FFTConvolutionImageFilter
usesitk::RealToHalfHermitianForwardFFTImageFilter
anditk::HalfHermitianToRealInverseFFTImageFilter
, so it makes sense to focus on those first - forward FFT enlarges the requested output region to the largest possible region by default. While this may maximize description of the whole image, it is not performant in a pipeline with varying requested image regions.
- As previously noted, forward FFT enlarges the input requested region to the largest possible region.
FFT enlarges the input requested region
This might not be due to performance improvement. It might be how the filter works. I think it needs the entire domain as it is equivalent to convolution with an infinite window size.
This might not be due to performance improvement. It might be how the filter works. I think it needs the entire domain as it is equivalent to convolution with an infinite window size.
Right, but at the same time it is reasonable to request only a subregion of the original image and act as if that is the "entire image" to process.
For instance, we want roughly the same result from applying forward FFT to the following two cases:
- A 3x3 image of all ones, operating over the largest possible region;
- A 9x9 image of zeros with a 3x3 grid of ones in the center, operating over the 3x3 grid of ones (index [3,3], size [3,3])
In both cases the FFT filter should "see" only the 3x3 grid of ones for its operations.
Does the non-FFT version ignore parts of the image outside of the requested region, or does it access the kernel-neighborhood-size worth of pixels outside of the requested region? If so, the FFT version's kernel size is the entire image, so they are equivalent in that regard.
FFT filter should "see" only the 3x3 grid
If you want that, you can first use region of interest filter and then apply FFT convolution. But I don't think the result will be equivalent.
FFT enlarges the input requested region to the largest possible region.
I might misunderstand the problem but I believe this is required: you need the whole input image to compute a piece of the FFT.
I see both of your points, this may be a snag in potential application to multiresolution speedup. Will experiment a bit with the region of interest filter to see how close we can get with cropping.
I have added a proof of concept notebook at https://github.com/InsightSoftwareConsortium/ITKVkFFTBackend/pull/47 demonstrating how we can use subregions in FFTs for convolution speedup. @dzenanz @SimonRit would you please take a look and let me know if you believe that procedure is sound?
The notebook does not explicitly rely on VkFFT speedup (I used Vnl backends for generating output), could be moved into ITKSphinxExamples in the future.
As a quick update, I am now actively working to apply the approved procedure in the notebook I linked above to itk::FFTConvolutionImageFilter
to leverage the ITK pipeline for speedup over image subregions. The filter currently assumes in many places that the largest possible region is being used, so effort is nontrivial but will benefit us in allowing multi resolution pyramid generation to be sped up with accelerated FFTs.
Starting to see improved results where FFT convolution time decreases as subregion size decreases. However, multithreaded spatial convolution still narrowly beats out FFT convolution for centered subregions. Will need to investigate further internal pipeline optimizations.