Summary

Implemented FBP filtering on GPU using cuFFT.

See #312

Test

Condition

Software	Version
Windows	10 64bit
Python(anaconda)	3.10.4
CUDA	11.8
Visual Studio	2022
GPU	GTX1060 6GB

Software	Version
Windows	10 64bit
MATLAB	2022a
CUDA	11.8
Visual Studio	2022
GPU	GTX1060 6GB

Python

import numpy as np
import time

import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt

listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
    print("Error: No gpu found")
else:
    for id in range(len(listGpuNames)):
        print("{}: {}".format(id, listGpuNames[id]))

gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)

mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag   # size of each pixel            (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)

nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)

# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))

# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter
tic = timer()
for _ in range(niter):
    fdkout = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=False)
toc = timer()
print("CPU: {}".format(toc-tic))
tic = timer()
for _ in range(niter):
    ossart = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=True)
toc = timer()
print("GPU: {}".format(toc-tic))


# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(fdkout, head, ["nRMSE"]))
print("RMSE fdk GPU: ")
print(Measure_Quality(ossart, head, ["nRMSE"]))

# Plot
fig, axes = plt.subplots(3, 3)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(fdkout[geo.nVoxel[0] // 2])
axes[1, 0].imshow(fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(fdkout[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(ossart[geo.nVoxel[0] // 2])
axes[1, 1].imshow(ossart[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(ossart[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("GPU-CPU")
axes[0, 2].imshow(ossart[geo.nVoxel[0] // 2]      -fdkout[geo.nVoxel[0] // 2])
axes[1, 2].imshow(ossart[:, geo.nVoxel[1] // 2, :]-fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(ossart[:, :, geo.nVoxel[2] // 2]-fdkout[:, :, geo.nVoxel[2] // 2])
plt.show()
# tigre.plotProj(proj)
# tigre.plotImg(fdkout)

Matlab

clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256]);                     

%% Load data and generate projections 
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated');
sprintf("size of projections = %s", mat2str(size(projections)))

niter = 10;
sprintf("niter = %d", niter)

disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',false);
end
toc
disp("GPU")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU=FDK(projections,geo,angles,'usegpufft',true);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU(:,:,50), imgFDK_CPU(:,:,50)- imgFDK_GPU(:,:,50)]);

Discussion

This code consumes more time than CPU even when the projections are size of 2048x2048.

I found that https://github.com/CERN/TIGRE/blob/98fa77f163bde8f24421b2de901e4ccc2a813d99/Common/CUDA/FbpFiltration.cu#L109-L113

and

https://github.com/CERN/TIGRE/blob/98fa77f163bde8f24421b2de901e4ccc2a813d99/Common/CUDA/FbpFiltration.cu#L158-L162

consumes time, while FFT and multiplying the filter in the Fourier space is fast enough. These are necessary as long as we use FFT C2C. R2C and C2R may work.

As for now, the CUDA codes are common for Matlab and Python because filtering.m transposes the matrix before calling fft.
Only one GPU is used.
Only one stream is used.

Jan 08 '23 12:01 tsadakane

R2C and C2R may work.

Worked. (a39e6b2)

Results of python test code

Conditions

0: NVIDIA GeForce GTX 1060 6GB {'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]} proj size = (100, 1024, 1024) niter = 10

Before

CPU: 24.03577409999707 s GPU: 28.770307099999627 s RMSE fdk CPU: 526.5072514306678 RMSE fdk GPU: 526.5072768954007

After

CPU: 24.188547600002494 s GPU: 12.808391700003995 s RMSE fdk CPU: 526.5072514306678 RMSE fdk GPU: 526.5072259659336

Jan 08 '23 16:01 tsadakane

@AnderBiguri Do you think we should keep the option to use/not to use GPU for FBP filtering for filtering and FDK function?

Jan 08 '23 16:01 tsadakane

@tsadakane Fantastic work, honestly!

I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?

I will test this on my side in a couple of machines (old/new) and report speed (sometime next week). I think if we can more or less verify that the filtering is always faster on GPU, we can just remove the CPU option from FBP/FDK. We can always leave the matlab/python code there, deprecated, for people to be able to read what the CUDA part does, but in an easier mode. What do you think?

Jan 08 '23 20:01 AnderBiguri

@AnderBiguri Thank you for your reply. I think it would be better to remove the old code if possible. Let me know the result and I can remove the option before submitting this PR.

I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?

The CPU is i7-11700 (8 cores, 16 threads). I found that only about 2 sec. in 12.8 sec. of GPU was creating and destroying FFT "plan"s and FFT, multipling filter and IFFT, other 7 sec. is consumed before and after calling the host function in FbpFiltration.cu, apply_filtration.

Jan 09 '23 13:01 tsadakane

@tsadakane fantastic, I think its not a bad idea, it will live in git's history anyway, so you are right. I will test on a few different machines to ensure we can reliably say the GPU is faster all the time.

In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?

Jan 10 '23 09:01 AnderBiguri

In my i7-7700HQ/GTX 1070 laptop, GPU is often slower:

MATLAB:

geo.nDetect 256^2, nagles=100, niter=10:

CPU Elapsed time is 6.132782 seconds. GPU Elapsed time is 8.440495 seconds.

geo.nDetect 512^2, nagles=100, niter=10:

CPU Elapsed time is 16.697335 seconds. GPU Elapsed time is 18.622791 seconds.

geo.nDetect 1024^2, nagles=100, niter=10:

CPU Elapsed time is 54.229799 seconds. GPU Elapsed time is 51.912400 seconds.

I still am quite surprised by this performance, I really did think it would be much faster. I'll try to dig a bit further, I am not sure if there is something extra we can optimize in the code that we may not be aware of, or if my assumption that it would be much faster is seriously flawed.

Jan 10 '23 09:01 AnderBiguri

In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?

Sorry for unclear English. I meant...

2 sec. is FFT, multiplying filter, and IFFT. (*)
7 sec. is python<->cuda interoperability stuff. (**)
other is cuda memory allocation and transfer.

(*) In apply_filtration, commenting out the lines of creating and destroying "plan", which nvidia calls the FFT object, and the lines in the curly bracket, the elapsed time reduced 2sec from 12sec. (became 10sec). In the 2 sec., 0.7 sec is for FFT, multiplying filter and IFFT, and 1.4 sec. is for creating/destroying the plan.

(**) Returning at the beginning of apply_filtration, the time became 7sec from 12sec.

Jan 10 '23 11:01 tsadakane

@tsadakane I see, thanks. So if we were to try to optimize this we'd need to look at the python<->cuda interoperability. I need to sit down and read the code more carefully, as this seems surprisingly high? What do you think?

Jan 10 '23 11:01 AnderBiguri

I have no idea so far.

Commenting out all but the last line of filtering function in filtering.py, which is called by FDK before Atb is called, the time became 4sec.
In the filtering function in fintering.py, I modified as

...
    if use_gpu:
        bundle_size = 32
        fproj=np.empty((bundle_size, geo.nDetector[0],filt_len),dtype=np.float32)
        for idx_bundle in range(0,(angles.shape[0]+bundle_size-1)//bundle_size):
            fproj.fill(0)
            for k in range(0, bundle_size):
                if idx_bundle*bundle_size+k < angles.shape[0]:
                    fproj[k, :,padding:padding+geo.nDetector[1]]=proj[idx_bundle*bundle_size+k]

            fproj = fproj.reshape([bundle_size*geo.nDetector[0], filt_len])
            fproj = fbpfiltration.apply(fproj, filt, gpuids)

            fproj = fproj.reshape([bundle_size, geo.nDetector[0], filt_len])
            for k in range(0, bundle_size):
                if idx_bundle*bundle_size+k < angles.shape[0]:
                    proj[idx_bundle*bundle_size+k]=fproj[k, :,padding:padding+geo.nDetector[1]] * scale_factor
    else:
...

to reduce the number of calling cuda function to 1/32, but the total time did not change (12s).

Jan 10 '23 15:01 tsadakane

I wrote multi-gpu version and cudaMemcpyAsync version, but as expected, the both were no effect.

Jan 12 '23 15:01 tsadakane

Just removing padding from filtering function in filtering.py reduced 3 sec., so I moved the padding into cuda function. This reduces the amount of memory to transfer. For effective strided data copy, I used cudaMemcpy2D, then GPU is now three times faster than GPU.

0: NVIDIA GeForce GTX 1060 6GB
{'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]}
proj size = (100, 1024, 1024)
niter = 10
CPU: 23.939920199998596
GPU: 7.977493500002311
RMSE fdk CPU: 
526.5072514306678
RMSE fdk GPU: 
526.5072514306678

filtering.py will be as

def filtering(proj, geo, angles, parker, verbose=False, use_gpu=True, gpuids=None):
    if parker:
        proj=parkerweight(proj.transpose(0,2,1),geo,angles,parker).transpose(0,2,1)

    filt_len=max(64,2**nextpow2(2*max(geo.nDetector)))
    ramp_kernel=ramp_flat(filt_len)

    d=1
    filt=filter(geo.filter,ramp_kernel[0],filt_len,d,verbose=verbose)

    padding = int((filt_len-geo.nDetector[1])//2 )
    scale_factor = (geo.DSD[0]/geo.DSO[0]) * (2 * np.pi/ len(angles)) / ( 4 * geo.dDetector[0] ) 
    if use_gpu:
        bundle_size = 32 #len(gpuids)
        n_bundles = (angles.shape[0]+bundle_size-1)//bundle_size
        for idx_bundle in range(0,n_bundles):
            bundle_size_actual = bundle_size if idx_bundle != n_bundles-1 else angles.shape[0]-idx_bundle*bundle_size
            idx_begin = idx_bundle*bundle_size
            idx_end = idx_bundle*bundle_size+bundle_size_actual
            proj[idx_begin:idx_end] = fbpfiltration.apply2(proj, idx_begin, idx_end, filt, scale_factor, gpuids) 

    else:
        filt=np.kron(np.ones((np.int64(geo.nDetector[0]),1),dtype=np.float32),filt)

        #filter 2 projection at a time packing in to complex container
        fproj=np.empty((geo.nDetector[0],filt_len),dtype=np.complex64)
        for i in range(0,angles.shape[0]-1,2):
            fproj.fill(0)
            fproj.real[:,padding:padding+geo.nDetector[1]]=proj[i]
            fproj.imag[:,padding:padding+geo.nDetector[1]]=proj[i+1]

            fproj=fft(fproj,axis=1)
            fproj=fproj*filt
            fproj=ifft(fproj,axis=1)

            proj[i]=fproj.real[:,padding:padding+geo.nDetector[1]] * scale_factor
            proj[i+1]=fproj.imag[:,padding:padding+geo.nDetector[1]] * scale_factor

        #if odd number of projections filter last solo
        if angles.shape[0] % 2:
            fproj.fill(0)
            fproj.real[:,padding:padding+geo.nDetector[1]]=proj[angles.shape[0]-1]

            fproj=fft(fproj,axis=1)
            fproj=fproj*filt
            fproj=np.real(ifft(fproj,axis=1))     
            proj[angles.shape[0]-1]=fproj[:,padding:padding+geo.nDetector[1]] * scale_factor

    return proj

I'm worried that by moving "padding" to cuda, the "if GPU" code is not parallel to CPU. What do you think?

I have not yet run this modification on Matlab and not yet commited.

Jan 12 '23 15:01 tsadakane

@tsadakane interesting!

Sorry, I am a bit busy this week, perhaps next week, so I don't have time to look at it with the detail it deserves. I intend to have a serious look at it as soon as I can.

Jan 13 '23 10:01 AnderBiguri

About aa51ec6

"CPU" uses CPU for convolution. In "GPU 1", the convolution is processed in a GPU, even if gpuids includes multiple GPUs. (=a39e6b2) In "GPU 2", the padding and convolution are processed in GPUs.

Tests

Python

test code

import numpy as np
import time

import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt

listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
    print("Error: No gpu found")
else:
    for id in range(len(listGpuNames)):
        print("{}: {}".format(id, listGpuNames[id]))

gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)

mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag   # size of each pixel            (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)

nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)

# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))

# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter

tic = timer()
for _ in range(niter):
    out_cpu = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=0)
toc = timer()
print("CPU  : {}".format(toc-tic))
tic = timer()
for _ in range(niter):
    out_gpu1 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=1)
toc = timer()
print("GPU 1: {}".format(toc-tic))

tic = timer()
for _ in range(niter):
    out_gpu2 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=2)
toc = timer()
print("GPU 2: {}".format(toc-tic))

# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(out_cpu, head, ["nRMSE"]))
print("RMSE fdk GPU 1: ")
print(Measure_Quality(out_gpu1, head, ["nRMSE"]))
print("RMSE fdk GPU 2: ")
print(Measure_Quality(out_gpu2, head, ["nRMSE"]))

# Plot
fig, axes = plt.subplots(3, 5)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(out_cpu[geo.nVoxel[0] // 2])
axes[1, 0].imshow(out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(out_gpu1[geo.nVoxel[0] // 2])
axes[1, 1].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("FDK FFT on GPU 2")
axes[0, 2].imshow(out_gpu2[geo.nVoxel[0] // 2])
axes[1, 2].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2])
axes[0, 3].set_title("GPU1-CPU")
axes[0, 3].imshow(out_gpu1[geo.nVoxel[0] // 2]      -out_cpu[geo.nVoxel[0] // 2])
axes[1, 3].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 3].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 4].set_title("GPU2-CPU")
axes[0, 4].imshow(out_gpu2[geo.nVoxel[0] // 2]      -out_cpu[geo.nVoxel[0] // 2])
axes[1, 4].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 4].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
plt.show()

Result

Software	Version
Windows	10 64bit
Python(Intel)	3.9.15
CUDA	11.8
Visual Studio	2022
GPU	2x NVIDIA GeForce RTX 2080 Ti

0: NVIDIA GeForce RTX 2080 Ti
1: NVIDIA GeForce RTX 2080 Ti
2: NVIDIA GeForce GTX 1070
{'name': 'NVIDIA GeForce RTX 2080 Ti', 'devices': [0, 1]}
proj size = (100, 1024, 1024)
niter = 10
[path to tigre working dir.]\Python\tigre\utilities\filtering.py:80: RuntimeWarning: divide by zero encountered in true_divide
  h[odd] = -1 / (np.pi * nn[odd]) ** 2
CPU  : 35.7683762
GPU 1: 19.560339300000003
GPU 2: 12.197235299999996
RMSE fdk CPU: 
526.5050614590306
RMSE fdk GPU 1: 
526.504934134818
RMSE fdk GPU 2: 
526.504934134818

Malab

Software	Version
Windows	10 64bit
MATLAB	2022a
CUDA	11.8
Visual Studio	2022
GPU	2x NVIDIA GeForce RTX 2080 Ti

Test code

clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256], "nDetector", [1024;1024]);                     

%% GPUs
gpuids = GpuIds('NVIDIA GeForce RTX 2080 Ti');
disp(gpuids);
disp(gpuids.devices);

%% Load data and generate projections 
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated', 'gpuids', gpuids);
sprintf("size of projections = %s", mat2str(size(projections)))

niter = 10;%10;
sprintf("niter = %d", niter)

disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',0, 'gpuids', gpuids);
end
toc
disp("GPU 1")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU1=FDK(projections,geo,angles,'usegpufft',1, 'gpuids', gpuids);
end
toc
disp("GPU 2")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU2=FDK(projections,geo,angles,'usegpufft',2, 'gpuids', gpuids);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU1(:,:,50), imgFDK_GPU2(:,:,50), abs(imgFDK_GPU1(:,:,50) - imgFDK_CPU(:,:,50)), abs(imgFDK_GPU2(:,:,50)- imgFDK_CPU(:,:,50))]);
%imshow([imgFDK_GPU(:,:,2)]);

Result

  GpuIds with properties:

       name: 'NVIDIA GeForce RTX 2080 Ti'
    devices: [0 1]
   0   1
ans = 
    "size of projections = [1024 1024 100]"
ans = 
    "niter = 10"
CPU
Elapsed time is 35.826591 seconds.
GPU 1
Elapsed time is 36.107104 seconds.
GPU 2
Elapsed time is 29.146955 seconds.

Jan 21 '23 10:01 tsadakane

The table below is the summary of the results above. The result of python is acceptable for me, but the result of matlab is hard to understand. At first, I suspected that only one GPU is used due to a bug, but when I checked, deviceCount in apply_filtration2 function was 2.

	Python [s]	Matlab [s]
CPU	35.8	35.8
GPU 1 (Filtering only)	19.6	36.1
GPU 2 (Padding and Filtering)	12.2	29.1

Here, two GPUs are used.

Jan 21 '23 11:01 tsadakane

I ran the python test program above after adding gpuids.devices.pop(-1) to use only one GPU.

Python

	2x GPU [s],	1x GPU [s]
CPU	35.9	33.8
GPU 1 (Filtering only)	19.2	16.8
GPU 2 (Padding and Filtering)	12.1	10.4

It seems that the overhead of the copying is larger than the benefit of doubling the GPU resource.

Jan 22 '23 03:01 tsadakane

@tsadakane that makes sense. Stil surprised about the times, but I think its my own false assumptions of how fast it should be. Still really good :)

Jan 25 '23 16:01 AnderBiguri

I am also a bit baffled by the MATLAB slowness, maybe something to do with extra copies? let me check.

Jan 26 '23 14:01 AnderBiguri

I think matlab code (copies &) transposes the matrix; the fastest variable in the memory is v in matlab while in python it is u, but in the cuda code, there is no difference between matlab and python.

Jan 26 '23 15:01 tsadakane

@tsadakane ah! We may be able to do then the same that we do with the other cuda code? #if IS_FOR_MATLAB_TIGRE kind of thing? I am not entirely sure where this is applied in the new CUDA code, happy to have a look if you want, but if you know where it happens in CUDA, feel free to do it yourself.

Thanks again, your work is really appreciated!

Jan 26 '23 15:01 AnderBiguri

I can do it, but I am not sure if it works, because cuda is not good at transposing.

It was not necessary to transpose the projection in ax and atb, because it is copied to the 3d texture, where there is no difference in the memory access efficiency between u and v axes. On the other hand, the convolution must be applied to the u-direction, so transposing in the memory layout is necessary.

That's said, I think, from the point of view of symmetry, permutation line should be moved to the filtering function if possible, i.e https://github.com/CERN/TIGRE/blob/f9a38edfa2eecabb724f1d9759eefdb041bbaf8d/MATLAB/Algorithms/FDK.m#L61-L77 =>

%% Weight
% proj=permute(proj,[2 1 3]);  %%%%%%%%%% Detele
for ii=1:size(angles,2)
    us = ((-geo.nDetector(1)/2+0.5):1:(geo.nDetector(1)/2-0.5))*geo.dDetector(1) + offset(1,ii);
    vs = ((-geo.nDetector(2)/2+0.5):1:(geo.nDetector(2)/2-0.5))*geo.dDetector(2) + offset(2,ii);
    [uu,vv] = meshgrid(us,vs); % detector
    
    % Create weight according to each detector element
    w = (geo.DSD(ii))./sqrt((geo.DSD(ii))^2+uu.^2 + vv.^2);
    
    % Multiply the weights with projection data
    proj(:,:,ii) = w.*proj(:,:,ii);   %%%%%%%%%% Transpose
end
%% Fourier transform based filtering
%proj=permute(proj,[2 1 3]);   %%%%%%%%%% Add this line at the beginning of the filtering function.
proj = filtering(proj,geo,angles,parker); % Not sure if offsets are good in here

This modification makes the code a little bit beautiful, but the specification of the filtering function will be changed (the first arg proj is not transposed).

What do you think?

Jan 28 '23 01:01 tsadakane

@tsadakane I see what you mean with the filtering. My convolution/fft maths are a bit rusty, so I may be completely wrong, but isn't there a way to change the shape of the filter itself so it is equivalent to transposing the projections? That would be the ideal solution, if possible, right?

About changing the transpose to inside filtering, 100% agree is much elegant.

Jan 29 '23 20:01 AnderBiguri

Hi @tsadakane, sorry I have been very busy with my life. How is this going? do you need any help? Is it done?

Mar 30 '23 17:03 AnderBiguri

Hi @AnderBiguri, I've been busy lately too. I've implemented the code to transpose matrices, but I am not satisfied with the result and trying several granularities of processing to pass to mex/cuda.

Mar 30 '23 23:03 tsadakane

Aparently making the fft and ifft calls in python have the workers argument can accelerate them. fproj = fft(fproj, axis=1, workers=cores), given cores=os.cpu_count(). Haven't tried it myself.

Jun 05 '23 13:06 AnderBiguri

Under the condition of

proj size = (100, 1024, 1024)
nVoxel = [256, 512, 512]

On 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz (8 physical core + hyper-threading => cores=16) + GTX1060

0, CPU: Padding & FFT       : 26.59433160000026 s @ 10 times
1, CPU: Multi-Workers       : 15.725976400000036 s @ 10 times

Not bad.

Jun 06 '23 13:06 tsadakane

I think the GPU implementation can be valuable too. I think the fastest FDK would be one that does interleave filtering and backprojection. I read a paper about it at some point.

Jun 06 '23 19:06 AnderBiguri

Hi @tsadakane What do you think about all this? worth merging it for the cases when its faster?

Feb 29 '24 10:02 AnderBiguri

Hi, @AnderBiguri ,

It's difficult to answer, but I'd like to say no.

Offloading a light weight task like FFT is not worth for the cost of copying back and forth between CPU and GPU (and plus transposing if its run from MATLAB).
Using fft's workers argument of python can achieve speedups comparable to GPU offloading.
TIGRE's flexibility will be impaired.

Feb 29 '24 15:02 tsadakane

@tsadakane makes sense!

Flexibility is a bit less important because we can add a flag with default CPU, but I think the rest of the points make total sense. Indeed, this would make sense if there was a FFT+Backproject in the same CUDA call, but then the multi-GPU code gets complex.

It may be worth leaving the PR here (if you are happy with that), in case anyone in the future wants to use the code?

Feb 29 '24 15:02 AnderBiguri

TIGRE TIGRE copied to clipboard

FBP filtering on GPU

Summary

Test

Condition

Python

Matlab

Discussion

Results of python test code

Conditions

Before

After

About aa51ec6

Tests

Python

test code

Result

Malab

Test code

Result

TIGRE
TIGRE copied to clipboard