TIGRE
TIGRE copied to clipboard
FBP filtering on GPU
Summary
Implemented FBP filtering on GPU using cuFFT.
See #312
Test
Condition
Software | Version |
---|---|
Windows | 10 64bit |
Python(anaconda) | 3.10.4 |
CUDA | 11.8 |
Visual Studio | 2022 |
GPU | GTX1060 6GB |
Software | Version |
---|---|
Windows | 10 64bit |
MATLAB | 2022a |
CUDA | 11.8 |
Visual Studio | 2022 |
GPU | GTX1060 6GB |
Python
import numpy as np
import time
import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt
listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
print("Error: No gpu found")
else:
for id in range(len(listGpuNames)):
print("{}: {}".format(id, listGpuNames[id]))
gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)
mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag # size of each pixel (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)
nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)
# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))
# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter
tic = timer()
for _ in range(niter):
fdkout = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=False)
toc = timer()
print("CPU: {}".format(toc-tic))
tic = timer()
for _ in range(niter):
ossart = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=True)
toc = timer()
print("GPU: {}".format(toc-tic))
# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(fdkout, head, ["nRMSE"]))
print("RMSE fdk GPU: ")
print(Measure_Quality(ossart, head, ["nRMSE"]))
# Plot
fig, axes = plt.subplots(3, 3)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(fdkout[geo.nVoxel[0] // 2])
axes[1, 0].imshow(fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(fdkout[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(ossart[geo.nVoxel[0] // 2])
axes[1, 1].imshow(ossart[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(ossart[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("GPU-CPU")
axes[0, 2].imshow(ossart[geo.nVoxel[0] // 2] -fdkout[geo.nVoxel[0] // 2])
axes[1, 2].imshow(ossart[:, geo.nVoxel[1] // 2, :]-fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(ossart[:, :, geo.nVoxel[2] // 2]-fdkout[:, :, geo.nVoxel[2] // 2])
plt.show()
# tigre.plotProj(proj)
# tigre.plotImg(fdkout)
Matlab
clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256]);
%% Load data and generate projections
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated');
sprintf("size of projections = %s", mat2str(size(projections)))
niter = 10;
sprintf("niter = %d", niter)
disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',false);
end
toc
disp("GPU")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU=FDK(projections,geo,angles,'usegpufft',true);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU(:,:,50), imgFDK_CPU(:,:,50)- imgFDK_GPU(:,:,50)]);
Discussion
This code consumes more time than CPU even when the projections are size of 2048x2048.
- I found that https://github.com/CERN/TIGRE/blob/98fa77f163bde8f24421b2de901e4ccc2a813d99/Common/CUDA/FbpFiltration.cu#L109-L113
and
https://github.com/CERN/TIGRE/blob/98fa77f163bde8f24421b2de901e4ccc2a813d99/Common/CUDA/FbpFiltration.cu#L158-L162
consumes time, while FFT and multiplying the filter in the Fourier space is fast enough. These are necessary as long as we use FFT C2C. R2C and C2R may work.
-
As for now, the CUDA codes are common for Matlab and Python because
filtering.m
transposes the matrix before calling fft. -
Only one GPU is used.
-
Only one stream is used.
R2C and C2R may work.
Worked. (a39e6b2)
Results of python test code
Conditions
0: NVIDIA GeForce GTX 1060 6GB {'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]} proj size = (100, 1024, 1024) niter = 10
Before
CPU: 24.03577409999707 s GPU: 28.770307099999627 s RMSE fdk CPU: 526.5072514306678 RMSE fdk GPU: 526.5072768954007
After
CPU: 24.188547600002494 s GPU: 12.808391700003995 s RMSE fdk CPU: 526.5072514306678 RMSE fdk GPU: 526.5072259659336
@AnderBiguri Do you think we should keep the option to use/not to use GPU for FBP filtering for filtering
and FDK
function?
@tsadakane Fantastic work, honestly!
I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?
I will test this on my side in a couple of machines (old/new) and report speed (sometime next week). I think if we can more or less verify that the filtering is always faster on GPU, we can just remove the CPU option from FBP/FDK. We can always leave the matlab/python code there, deprecated, for people to be able to read what the CUDA part does, but in an easier mode. What do you think?
@AnderBiguri Thank you for your reply. I think it would be better to remove the old code if possible. Let me know the result and I can remove the option before submitting this PR.
I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?
The CPU is i7-11700 (8 cores, 16 threads).
I found that only about 2 sec. in 12.8 sec. of GPU was creating and destroying FFT "plan"s and FFT, multipling filter and IFFT, other 7 sec. is consumed before and after calling the host function in FbpFiltration.cu, apply_filtration
.
@tsadakane fantastic, I think its not a bad idea, it will live in git's history anyway, so you are right. I will test on a few different machines to ensure we can reliably say the GPU is faster all the time.
In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?
In my i7-7700HQ/GTX 1070 laptop, GPU is often slower:
MATLAB:
geo.nDetect 256^2, nagles=100, niter=10:
CPU Elapsed time is 6.132782 seconds. GPU Elapsed time is 8.440495 seconds.
geo.nDetect 512^2, nagles=100, niter=10:
CPU Elapsed time is 16.697335 seconds. GPU Elapsed time is 18.622791 seconds.
geo.nDetect 1024^2, nagles=100, niter=10:
CPU Elapsed time is 54.229799 seconds. GPU Elapsed time is 51.912400 seconds.
I still am quite surprised by this performance, I really did think it would be much faster. I'll try to dig a bit further, I am not sure if there is something extra we can optimize in the code that we may not be aware of, or if my assumption that it would be much faster is seriously flawed.
In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?
Sorry for unclear English. I meant...
- 2 sec. is FFT, multiplying filter, and IFFT. (*)
- 7 sec. is python<->cuda interoperability stuff. (**)
- other is cuda memory allocation and transfer.
(*) In apply_filtration
, commenting out the lines of creating and destroying "plan", which nvidia calls the FFT object, and the lines in the curly bracket, the elapsed time reduced 2sec from 12sec. (became 10sec).
In the 2 sec., 0.7 sec is for FFT, multiplying filter and IFFT, and 1.4 sec. is for creating/destroying the plan.
(**) Returning at the beginning of apply_filtration
, the time became 7sec from 12sec.
@tsadakane I see, thanks. So if we were to try to optimize this we'd need to look at the python<->cuda interoperability. I need to sit down and read the code more carefully, as this seems surprisingly high? What do you think?
I have no idea so far.
- Commenting out all but the last line of
filtering
function infiltering.py
, which is called byFDK
before Atb is called, the time became 4sec. - In the
filtering
function infintering.py
, I modified as
...
if use_gpu:
bundle_size = 32
fproj=np.empty((bundle_size, geo.nDetector[0],filt_len),dtype=np.float32)
for idx_bundle in range(0,(angles.shape[0]+bundle_size-1)//bundle_size):
fproj.fill(0)
for k in range(0, bundle_size):
if idx_bundle*bundle_size+k < angles.shape[0]:
fproj[k, :,padding:padding+geo.nDetector[1]]=proj[idx_bundle*bundle_size+k]
fproj = fproj.reshape([bundle_size*geo.nDetector[0], filt_len])
fproj = fbpfiltration.apply(fproj, filt, gpuids)
fproj = fproj.reshape([bundle_size, geo.nDetector[0], filt_len])
for k in range(0, bundle_size):
if idx_bundle*bundle_size+k < angles.shape[0]:
proj[idx_bundle*bundle_size+k]=fproj[k, :,padding:padding+geo.nDetector[1]] * scale_factor
else:
...
to reduce the number of calling cuda function to 1/32, but the total time did not change (12s).
I wrote multi-gpu version and cudaMemcpyAsync version, but as expected, the both were no effect.
Just removing padding from filtering
function in filtering.py
reduced 3 sec., so I moved the padding into cuda function.
This reduces the amount of memory to transfer. For effective strided data copy, I used cudaMemcpy2D, then GPU is now three times faster than GPU.
0: NVIDIA GeForce GTX 1060 6GB
{'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]}
proj size = (100, 1024, 1024)
niter = 10
CPU: 23.939920199998596
GPU: 7.977493500002311
RMSE fdk CPU:
526.5072514306678
RMSE fdk GPU:
526.5072514306678
filtering.py
will be as
def filtering(proj, geo, angles, parker, verbose=False, use_gpu=True, gpuids=None):
if parker:
proj=parkerweight(proj.transpose(0,2,1),geo,angles,parker).transpose(0,2,1)
filt_len=max(64,2**nextpow2(2*max(geo.nDetector)))
ramp_kernel=ramp_flat(filt_len)
d=1
filt=filter(geo.filter,ramp_kernel[0],filt_len,d,verbose=verbose)
padding = int((filt_len-geo.nDetector[1])//2 )
scale_factor = (geo.DSD[0]/geo.DSO[0]) * (2 * np.pi/ len(angles)) / ( 4 * geo.dDetector[0] )
if use_gpu:
bundle_size = 32 #len(gpuids)
n_bundles = (angles.shape[0]+bundle_size-1)//bundle_size
for idx_bundle in range(0,n_bundles):
bundle_size_actual = bundle_size if idx_bundle != n_bundles-1 else angles.shape[0]-idx_bundle*bundle_size
idx_begin = idx_bundle*bundle_size
idx_end = idx_bundle*bundle_size+bundle_size_actual
proj[idx_begin:idx_end] = fbpfiltration.apply2(proj, idx_begin, idx_end, filt, scale_factor, gpuids)
else:
filt=np.kron(np.ones((np.int64(geo.nDetector[0]),1),dtype=np.float32),filt)
#filter 2 projection at a time packing in to complex container
fproj=np.empty((geo.nDetector[0],filt_len),dtype=np.complex64)
for i in range(0,angles.shape[0]-1,2):
fproj.fill(0)
fproj.real[:,padding:padding+geo.nDetector[1]]=proj[i]
fproj.imag[:,padding:padding+geo.nDetector[1]]=proj[i+1]
fproj=fft(fproj,axis=1)
fproj=fproj*filt
fproj=ifft(fproj,axis=1)
proj[i]=fproj.real[:,padding:padding+geo.nDetector[1]] * scale_factor
proj[i+1]=fproj.imag[:,padding:padding+geo.nDetector[1]] * scale_factor
#if odd number of projections filter last solo
if angles.shape[0] % 2:
fproj.fill(0)
fproj.real[:,padding:padding+geo.nDetector[1]]=proj[angles.shape[0]-1]
fproj=fft(fproj,axis=1)
fproj=fproj*filt
fproj=np.real(ifft(fproj,axis=1))
proj[angles.shape[0]-1]=fproj[:,padding:padding+geo.nDetector[1]] * scale_factor
return proj
I'm worried that by moving "padding" to cuda, the "if GPU" code is not parallel to CPU. What do you think?
I have not yet run this modification on Matlab and not yet commited.
@tsadakane interesting!
Sorry, I am a bit busy this week, perhaps next week, so I don't have time to look at it with the detail it deserves. I intend to have a serious look at it as soon as I can.
About aa51ec6
"CPU" uses CPU for convolution.
In "GPU 1", the convolution is processed in a GPU, even if gpuids
includes multiple GPUs. (=a39e6b2)
In "GPU 2", the padding and convolution are processed in GPUs.
Tests
Python
test code
import numpy as np
import time
import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt
listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
print("Error: No gpu found")
else:
for id in range(len(listGpuNames)):
print("{}: {}".format(id, listGpuNames[id]))
gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)
mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag # size of each pixel (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)
nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)
# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))
# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter
tic = timer()
for _ in range(niter):
out_cpu = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=0)
toc = timer()
print("CPU : {}".format(toc-tic))
tic = timer()
for _ in range(niter):
out_gpu1 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=1)
toc = timer()
print("GPU 1: {}".format(toc-tic))
tic = timer()
for _ in range(niter):
out_gpu2 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=2)
toc = timer()
print("GPU 2: {}".format(toc-tic))
# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(out_cpu, head, ["nRMSE"]))
print("RMSE fdk GPU 1: ")
print(Measure_Quality(out_gpu1, head, ["nRMSE"]))
print("RMSE fdk GPU 2: ")
print(Measure_Quality(out_gpu2, head, ["nRMSE"]))
# Plot
fig, axes = plt.subplots(3, 5)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(out_cpu[geo.nVoxel[0] // 2])
axes[1, 0].imshow(out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(out_gpu1[geo.nVoxel[0] // 2])
axes[1, 1].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("FDK FFT on GPU 2")
axes[0, 2].imshow(out_gpu2[geo.nVoxel[0] // 2])
axes[1, 2].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2])
axes[0, 3].set_title("GPU1-CPU")
axes[0, 3].imshow(out_gpu1[geo.nVoxel[0] // 2] -out_cpu[geo.nVoxel[0] // 2])
axes[1, 3].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 3].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 4].set_title("GPU2-CPU")
axes[0, 4].imshow(out_gpu2[geo.nVoxel[0] // 2] -out_cpu[geo.nVoxel[0] // 2])
axes[1, 4].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 4].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
plt.show()
Result
Software | Version |
---|---|
Windows | 10 64bit |
Python(Intel) | 3.9.15 |
CUDA | 11.8 |
Visual Studio | 2022 |
GPU | 2x NVIDIA GeForce RTX 2080 Ti |
0: NVIDIA GeForce RTX 2080 Ti
1: NVIDIA GeForce RTX 2080 Ti
2: NVIDIA GeForce GTX 1070
{'name': 'NVIDIA GeForce RTX 2080 Ti', 'devices': [0, 1]}
proj size = (100, 1024, 1024)
niter = 10
[path to tigre working dir.]\Python\tigre\utilities\filtering.py:80: RuntimeWarning: divide by zero encountered in true_divide
h[odd] = -1 / (np.pi * nn[odd]) ** 2
CPU : 35.7683762
GPU 1: 19.560339300000003
GPU 2: 12.197235299999996
RMSE fdk CPU:
526.5050614590306
RMSE fdk GPU 1:
526.504934134818
RMSE fdk GPU 2:
526.504934134818
Malab
Software | Version |
---|---|
Windows | 10 64bit |
MATLAB | 2022a |
CUDA | 11.8 |
Visual Studio | 2022 |
GPU | 2x NVIDIA GeForce RTX 2080 Ti |
Test code
clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256], "nDetector", [1024;1024]);
%% GPUs
gpuids = GpuIds('NVIDIA GeForce RTX 2080 Ti');
disp(gpuids);
disp(gpuids.devices);
%% Load data and generate projections
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated', 'gpuids', gpuids);
sprintf("size of projections = %s", mat2str(size(projections)))
niter = 10;%10;
sprintf("niter = %d", niter)
disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',0, 'gpuids', gpuids);
end
toc
disp("GPU 1")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU1=FDK(projections,geo,angles,'usegpufft',1, 'gpuids', gpuids);
end
toc
disp("GPU 2")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU2=FDK(projections,geo,angles,'usegpufft',2, 'gpuids', gpuids);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU1(:,:,50), imgFDK_GPU2(:,:,50), abs(imgFDK_GPU1(:,:,50) - imgFDK_CPU(:,:,50)), abs(imgFDK_GPU2(:,:,50)- imgFDK_CPU(:,:,50))]);
%imshow([imgFDK_GPU(:,:,2)]);
Result
GpuIds with properties:
name: 'NVIDIA GeForce RTX 2080 Ti'
devices: [0 1]
0 1
ans =
"size of projections = [1024 1024 100]"
ans =
"niter = 10"
CPU
Elapsed time is 35.826591 seconds.
GPU 1
Elapsed time is 36.107104 seconds.
GPU 2
Elapsed time is 29.146955 seconds.
The table below is the summary of the results above.
The result of python is acceptable for me, but the result of matlab is hard to understand. At first, I suspected that only one GPU is used due to a bug, but when I checked, deviceCount
in apply_filtration2
function was 2.
Python [s] | Matlab [s] | |
---|---|---|
CPU | 35.8 | 35.8 |
GPU 1 (Filtering only) | 19.6 | 36.1 |
GPU 2 (Padding and Filtering) | 12.2 | 29.1 |
Here, two GPUs are used.
I ran the python test program above after adding gpuids.devices.pop(-1)
to use only one GPU.
Python
2x GPU [s], | 1x GPU [s] | |
---|---|---|
CPU | 35.9 | 33.8 |
GPU 1 (Filtering only) | 19.2 | 16.8 |
GPU 2 (Padding and Filtering) | 12.1 | 10.4 |
It seems that the overhead of the copying is larger than the benefit of doubling the GPU resource.
@tsadakane that makes sense. Stil surprised about the times, but I think its my own false assumptions of how fast it should be. Still really good :)
I am also a bit baffled by the MATLAB slowness, maybe something to do with extra copies? let me check.
I think matlab code (copies &) transposes the matrix; the fastest variable in the memory is v
in matlab while in python it is u
, but in the cuda code, there is no difference between matlab and python.
@tsadakane ah! We may be able to do then the same that we do with the other cuda code? #if IS_FOR_MATLAB_TIGRE
kind of thing? I am not entirely sure where this is applied in the new CUDA code, happy to have a look if you want, but if you know where it happens in CUDA, feel free to do it yourself.
Thanks again, your work is really appreciated!
I can do it, but I am not sure if it works, because cuda is not good at transposing.
It was not necessary to transpose the projection in ax and atb, because it is copied to the 3d texture, where there is no difference in the memory access efficiency between u and v axes. On the other hand, the convolution must be applied to the u-direction, so transposing in the memory layout is necessary.
That's said, I think, from the point of view of symmetry, permutation line should be moved to the filtering function if possible, i.e https://github.com/CERN/TIGRE/blob/f9a38edfa2eecabb724f1d9759eefdb041bbaf8d/MATLAB/Algorithms/FDK.m#L61-L77 =>
%% Weight
% proj=permute(proj,[2 1 3]); %%%%%%%%%% Detele
for ii=1:size(angles,2)
us = ((-geo.nDetector(1)/2+0.5):1:(geo.nDetector(1)/2-0.5))*geo.dDetector(1) + offset(1,ii);
vs = ((-geo.nDetector(2)/2+0.5):1:(geo.nDetector(2)/2-0.5))*geo.dDetector(2) + offset(2,ii);
[uu,vv] = meshgrid(us,vs); % detector
% Create weight according to each detector element
w = (geo.DSD(ii))./sqrt((geo.DSD(ii))^2+uu.^2 + vv.^2);
% Multiply the weights with projection data
proj(:,:,ii) = w.*proj(:,:,ii); %%%%%%%%%% Transpose
end
%% Fourier transform based filtering
%proj=permute(proj,[2 1 3]); %%%%%%%%%% Add this line at the beginning of the filtering function.
proj = filtering(proj,geo,angles,parker); % Not sure if offsets are good in here
This modification makes the code a little bit beautiful, but the specification of the filtering function will be changed (the first arg proj
is not transposed).
What do you think?
@tsadakane I see what you mean with the filtering. My convolution/fft maths are a bit rusty, so I may be completely wrong, but isn't there a way to change the shape of the filter itself so it is equivalent to transposing the projections? That would be the ideal solution, if possible, right?
About changing the transpose to inside filtering
, 100% agree is much elegant.
Hi @tsadakane, sorry I have been very busy with my life. How is this going? do you need any help? Is it done?
Hi @AnderBiguri, I've been busy lately too. I've implemented the code to transpose matrices, but I am not satisfied with the result and trying several granularities of processing to pass to mex/cuda.
Aparently making the fft
and ifft
calls in python have the workers
argument can accelerate them. fproj = fft(fproj, axis=1, workers=cores)
, given cores=os.cpu_count()
. Haven't tried it myself.
Under the condition of
proj size = (100, 1024, 1024)
nVoxel = [256, 512, 512]
On 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
(8 physical core + hyper-threading => cores=16) + GTX1060
0, CPU: Padding & FFT : 26.59433160000026 s @ 10 times
1, CPU: Multi-Workers : 15.725976400000036 s @ 10 times
Not bad.
I think the GPU implementation can be valuable too. I think the fastest FDK would be one that does interleave filtering and backprojection. I read a paper about it at some point.
Hi @tsadakane What do you think about all this? worth merging it for the cases when its faster?
Hi, @AnderBiguri ,
It's difficult to answer, but I'd like to say no.
- Offloading a light weight task like FFT is not worth for the cost of copying back and forth between CPU and GPU (and plus transposing if its run from MATLAB).
- Using fft's workers argument of python can achieve speedups comparable to GPU offloading.
- TIGRE's flexibility will be impaired.
@tsadakane makes sense!
Flexibility is a bit less important because we can add a flag with default CPU, but I think the rest of the points make total sense. Indeed, this would make sense if there was a FFT+Backproject in the same CUDA call, but then the multi-GPU code gets complex.
It may be worth leaving the PR here (if you are happy with that), in case anyone in the future wants to use the code?