VkFFT C2C Convolution with Multiple Uploads

I'm attempting to perform 1D C2C convolution (in double precision). The basic setup is to create a buffer containing 4,096 elements and convolve it with a kernel also containing 4,096 elements (zero-padded manually).

When I do this and set the size to 2048, the convolution completes and the output results are as expected. When the FFT size is increased to 4096 however, the results are incorrect. It appears the difference between the two sizes is that 4096 elements requires the process be broken into multiple uploads (this is my belief based on the result of printMemoryLayout).

Device: NVIDIA Tesla T4 The (reduced example) code below is the setup of the relevant VkFFT code. The shader code output with both N=2048 and N=4096 is also included.

shader_2048.txt shader_4096.txt

// SETUP
VkGPU* vkGpu = new VkGPU();
vkGpu->device_id = 0;

createInstance(vkGpu, 0);
setupDebugMessenger(vkGpu);
findPhysicalDevice(vkGpu);
createDevice(vkGpu, 0);
createFence(vkGpu);
createCommandPool(vkGpu);
vkGetPhysicalDeviceProperties(vkGpu->physicalDevice,
  &vkGpu->physicalDeviceProperties);
vkGetPhysicalDeviceMemoryProperties(vkGpu->physicalDevice,
  &vkGpu->physicalDeviceMemoryProperties);

glslang_initialize_process();

// CONFIGURATION
uint64_t kernelBufferSize = sizeof(double) * (2 * N);

VkFFTConfiguration* conf = new vkFFTConfiguration();
conf->FFTdim = 1;
conf->size[0] = N;
conf->size[1] = 1;
conf->size[2] = 1;
conf->kernelConvolution = true;
conf->coordinateFeatures = 1;
conf->normalize = 1;

conf->device = &_vkGpu->device;
conf->queue = &_vkGpu->queue;
conf->fence = &_vkGpu->fence;
conf->commandPool = &_vkGpu->commandPool;
conf->physicalDevice = &_vkGpu->physicalDevice;
conf->isCompilerInitialized = true;
conf->doublePrecision = true;
conf->printMemoryLayout = true;

VkBuffer* kernel = new VkBuffer();
VkDeviceMemory* kernelDeviceMemory = new VkDeviceMemory();
allocateBuffer(_vkGpu, kernel, kernelDeviceMemory,
  VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT |
  VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, kernelBufferSize);
conf->buffer = kernel;
conf->bufferSize = &kernelBufferSize;

VkFFTApplication* appKernel = new VkFFTApplication();
transferDataFromCPU(_vkGpu, data, kernel, kernelBufferSize);
initializeVkFFT(appKernel, *conf);


VkFFTConfiguration* conf = new vkFFTConfiguration();
convConf = conf;
conf->kernelConvolution = false;
conf->conjugateConvolution = 0;
convConf->performConvolution = true;
convConf->coordinateFeatures = 1;
convConf->kernel = conf->buffer;

VkBuffer* buffer = new VkBuffer();
VkDeviceMemory* bufferDeviceMemory = new VkDeviceMemory();
allocateBuffer(_vkGpu, buffer, bufferDeviceMemory,
  VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT |
  VK_BUFFER_USAGE_TRANSFER_DST_BIT, VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, kernelBufferSize);
convConf->buffer = buffer;
convConf->bufferSize = &kernelBufferSize;
convConf->kernelSize = &kernelBufferSize;
convConf->numberBatches = N / 8;

VkFFTApplication* appConv = new VkFFTApplication();
initializeVkFFT(appConv, *convConf);
transferDataFromCPU(_vkGpu, input, buffer, kernelBufferSize);

// PERFORM CALCULATION
VkFFTLaunchParams launchParams = {};
performVulkanFFT(vkGpu, appConv, &launchParams, -1, 1);

// GET OUTPUT
double result[2*N];
transferDataToCPU(vkGpu, result, &buffer, kernelBufferSize);

// TEAR DOWN
deleteVkFFT(appKernel);
deleteVkFFT(appConv);
delete conf;

vkDestroyFence(vkGpu->device, vkGpu->fence, NULL);
vkDestroyCommandPool(vkGpu->device, vkGpu->commandPool, NULL);
vkDestroyDevice(vkGpu->device, NULL);
DestroyDebugUtilsMessengerEXT(vkGpu, NULL);
vkDestroyInstance(vkGpu->instance, NULL);
glslang_finalize_process();
delete vkGpu;

Aug 01 '22 15:08 jgeary18

Hello,

I have checked the results of Vulkan (2 uploads) and CUDA (1 upload as driver exposes more shared memory) backends for the 4096 system and they match. So I will need more information on your configuration:

The code you generated uses batching in it. As of now, VkFFT doesn't have multiple inputs - 1 kernel convolutions, so the code you have will access the out-of-bounds memory for the kernel. If this use case is your target, I can add it later.
Can you try the code with 1 kernel - 1 system (convConf->numberBatches = 1)? And send the input/ output contents of the FFT so I can test your particular inputs, please.

Best regards, Dmitrii

Aug 03 '22 21:08 DTolm

Dmitrii,

Thank you for your quick response.

I tried switching the number of batches from N/ 8 to 1, and actually do not see any difference in my results. Despite that, I've left the value at 1 and ran the program again to provide you my inputs and outputs (attached).

input.csv - The data contained in the buffer kernel.csv - The data contained in the kernel vkfft_output.csv - The output I received matlab_output.csv - The output I would have expected running the same set of operations through MATLAB (same input and kernel data)

Thanks for taking a look at this.

data.zip

Aug 04 '22 13:08 jgeary18

Ah, I see the issue now.

VkFFT stores kernel in the non-original data layout format in the multiple uploads case. This is done because if you usually do the FFT with multiple uploads, you need to transpose the intermediate matrix representation as the last step.

However, in the case of VkFFT convolutions - if both kernel and FFT are done with VkFFT, we can multiply them element-wise in the frequency domain without transposition - and then just inverse the order during inverse FFT. This allows us to get 2x memory savings (as this transposition is not done in place) and save memory transfers (as we can merge the last step of FFT with kernel multiplication).

The solution is to either do kernel initialization with VkFFT from the beginning or to transpose your kernel matrix according to the inner constructs of VkFFT [for 4096 sequence it will be represented as a 64x64 2D matrix for transpose like i->(i%64)*64+i/64)]. I tested both versions and they produced the correct results.

You can get the inner matrix dimensions in app->localFFTPlan->axisSplit array. The values there as stored in an increasing stride order so i->(i%axisSplit[1])*axisSplit[0]+i/axisSplit[1]). For triple upload this is a bit more tricky as the array becomes 3d, but with the same logic.

Best regards, Dmitrii

Aug 04 '22 14:08 DTolm

Dmitrii,

Thank you, that makes sense, and I was able to get this to work as expected by doing the kernel initialization with VkFFT from the beginning and following that path through.

For my use case, I'd also like to be able to implement the transposition approach you mentioned. I understand how to transpose the data using the logic you provided, but I'm struggling to find the right configuration options to get this working. Currently, I'm treating the data as 2 1D vectors, so I assume when I switch to treating the data as a 2D matrix, I need to modify some of the configuration (perhaps dimensions/size or coordinateFeatures/matrixConvolution values). I was hoping you could give me some guidance on what needs to change in my configuration to get this working using the transpose method.

Aug 04 '22 20:08 jgeary18

I need to modify some of the configuration (perhaps dimensions/size or coordinateFeatures/matrixConvolution values)

No, there is no changes that need to be done to configuration. You initialize the VkFFT just like you did before and then get app->localFFTPlan->axisSplit from the convolution application. Then you reorder data before submitting the dispatch call. The data reordering just views at a continuous array of complex numbers and performs a i->(i%axisSplit[1])*axisSplit[0]+i/axisSplit[1]) operation. 2D matrix transposition is just a human explanation of what is happening (or not happening in case of convolutions) inside VkFFT.

Aug 04 '22 20:08 DTolm

Certainly something must be different than the original configuration, because with the current setup, axisSplit[1] has a value of 0.

Aug 05 '22 17:08 jgeary18

Do you access it after initializeVkFFT call? Because it will be initialized only after the call.

Aug 05 '22 21:08 DTolm

Yes, I access it after the initializeVkFFT call and before the dispatch. When looking at it, axisSplit[0] has a value of 64, but axisSplit[1] is still 0.

When testing performing the transpose values of 64 as you have indicated above does yield the correct output from the FFT, so it seems it is working as expected in the code, but the value for axisSplit[1] is not what was expected.

Aug 08 '22 14:08 jgeary18

Dmitrii,

I was hoping you could clear up an issue for me that I'm having at this point. I have chosen to take the transpose approach for a complex input containing 4096 elements. I believe my transpose is correct, but the result is not quite what I'd expect.

Using the same set of data, I seem to get different results between the transpose approach and the approach of setting up the kernel data using VkFFT (taking the IFFT of the kernel data in MATLAB then using VkFFT to do the FFT without transposition). For reference, the section of code that does my transposition is also included here:

std::vector<std::complex<double>> kernelInput;
// Code to fill kernelInput goes here
int32_t dim = 64;
std::vector<std::complex<double>> transposedKernel;
transposedKernel.resize(N);
for (int32_t i = 0; i < N; i++) {
  int32_t idx = (i % dim) * dim + (i / dim);
  transposedKernel[idx] = kernelInput[i];

Based on MATLAB calculations, I believe the transpose approach is not giving me the correct output. I have done the transposition assuming axisSplit[1], which has a value of 0, should be the same as axisSplit[0], which has a value of 64. I'm using the same setup as described in my initial approach, with N = 4096.

I've attached the input data, the kernel data, the result of my transpose of the kernel data, the result from VkFFT, and the expected result from MATLAB. I've also included the generated shader code. If you could take a look at this and tell me what I'm doing wrong, I'd greatly appreciate it.

vkFFTDebugData.zip

Oct 24 '22 19:10 jgeary18

Dear @jgeary18,

I am truly sorry, I have previously mentioned axisSplit wrong - it is actually a 2D array, with the first dimension related to x,y,z axis and the second being the upload size. This is why axisSplit[1] is 0 in your case - you should access it as axisSplit[0][1].

I have added a transposition implementation to FP128 initialization of kernels - you can just take the code from there - lines 31696-31701 in the v1.2.31 version.

Hope this helps and feel free to ask other questions!

Best regards, Dmitrii

Oct 25 '22 19:10 DTolm

VkFFT VkFFT copied to clipboard

C2C Convolution with Multiple Uploads

VkFFT
VkFFT copied to clipboard