Lattice_Lightsheet_Deskew_Deconv
Lattice_Lightsheet_Deskew_Deconv copied to clipboard
Performance in batch mode
For this particular dataset, the naive approach takes about 1s per image including read/write if I use this naive approach. I can see the GPU utilization going up and down as well.

In my commandline batch tool processing that same dataset is almost a factor of 4 slower:

That code does an additional affine transform and MIP but that should not make a significant difference. Maybe the overhead is due to passing around partially evaluated functions.
partial function evaluation indeed incurs a performance penalty https://stackoverflow.com/questions/17388438/python-functools-partial-efficiency Still the slowdown is almost a factor of 4 which is difficult to explain just by function call overhead.
Profiling with cProfile. This is for deconvolving 100 volumes with 10 iterations each. Actual deconvolution is about 0.75 s / frame. The next thing is a pyopencl call (probably related to deskew/rotate) taking nearly 0.4s. np.astype takes up a considerable amount of processing time as do reading and writing. Some of this stuff could probably be parallelized.
Fri Mar 15 11:57:54 2019 process_stats
2275854 function calls (2263132 primitive calls) in 347.478 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
100 75.846 0.758 75.846 0.758 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
101 39.725 0.393 39.725 0.393 {built-in method pyopencl._cl._enqueue_read_buffer}
227 36.042 0.159 36.042 0.159 {built-in method numpy.core.multiarray.concatenate}
922 34.942 0.038 34.942 0.038 {method 'astype' of 'numpy.ndarray' objects}
101 32.304 0.320 32.305 0.320 /home/vhil0002/anaconda3/envs/newllsm/lib/python3.6/site-packages/pyopencl/__init__.py:872(image_init)
10252 20.804 0.002 20.804 0.002 {method 'readinto' of '_io.BufferedReader' objects}
101 20.215 0.200 20.215 0.200 {method 'tofile' of 'numpy.ndarray' objects}
217 19.790 0.091 19.790 0.091 {built-in method numpy.core.multiarray.copyto}
101 13.144 0.130 13.144 0.130 {method 'clip' of 'numpy.ndarray' objects}
101 12.960 0.128 12.960 0.128 {built-in method pyopencl._cl.enqueue_nd_range_kernel}
100 8.832 0.088 343.422 3.434 /home/vhil0002/anaconda3/envs/newllsm/lib/python3.6/site-packages/l
have run similar profiling for gputools RL deconv. There, most of the time is spent in np.astype
some more comments about gputools deconvolve (seperate from flowdec). There are some obvious improvements that can be made, e.g. an FFT plan is calculated in the gputools implementation but never used. The PSF is pre-processed and sent to the GPU each time the deconvolution is called. Separating the deconvolution into an init and a run step would allow for processing the PSF once and leaving it on the GPU.
TODO: check whether flowdec sends the PSF to the CPU each time (I believe it does). Maybe that can be optimized as well.
rewrote maweigerts gputools based convolution to reuse fft-plan, processed psf (remains in GPU ram) and temporary gpu-buffers. Removed unnecessary duplicate .astype(np.complex64).
See https://github.com/VolkerH/Lattice_Lightsheet_Deskew_Deconv/blob/benchmarking/lls_dd/deconv_gputools_rewrite.py
Major speed improvement. Actual deconvolution much faster than time per iteration. Will have to read/write from disk in seperate threads.

Cprofile stats for the above three runs:
gputools rewrite:
gputools:
flowdec:
