Lattice_Lightsheet_Deskew_Deconv Performance in batch mode

For this particular dataset, the naive approach takes about 1s per image including read/write if I use this naive approach. I can see the GPU utilization going up and down as well.

In my commandline batch tool processing that same dataset is almost a factor of 4 slower:

That code does an additional affine transform and MIP but that should not make a significant difference. Maybe the overhead is due to passing around partially evaluated functions.

Mar 14 '19 06:03 VolkerH

partial function evaluation indeed incurs a performance penalty https://stackoverflow.com/questions/17388438/python-functools-partial-efficiency Still the slowdown is almost a factor of 4 which is difficult to explain just by function call overhead.

Mar 14 '19 22:03 VolkerH

Profiling with cProfile. This is for deconvolving 100 volumes with 10 iterations each. Actual deconvolution is about 0.75 s / frame. The next thing is a pyopencl call (probably related to deskew/rotate) taking nearly 0.4s. np.astype takes up a considerable amount of processing time as do reading and writing. Some of this stuff could probably be parallelized.

Fri Mar 15 11:57:54 2019    process_stats

         2275854 function calls (2263132 primitive calls) in 347.478 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100   75.846    0.758   75.846    0.758 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
      101   39.725    0.393   39.725    0.393 {built-in method pyopencl._cl._enqueue_read_buffer}
      227   36.042    0.159   36.042    0.159 {built-in method numpy.core.multiarray.concatenate}
      922   34.942    0.038   34.942    0.038 {method 'astype' of 'numpy.ndarray' objects}
      101   32.304    0.320   32.305    0.320 /home/vhil0002/anaconda3/envs/newllsm/lib/python3.6/site-packages/pyopencl/__init__.py:872(image_init)
    10252   20.804    0.002   20.804    0.002 {method 'readinto' of '_io.BufferedReader' objects}
      101   20.215    0.200   20.215    0.200 {method 'tofile' of 'numpy.ndarray' objects}
      217   19.790    0.091   19.790    0.091 {built-in method numpy.core.multiarray.copyto}
      101   13.144    0.130   13.144    0.130 {method 'clip' of 'numpy.ndarray' objects}
      101   12.960    0.128   12.960    0.128 {built-in method pyopencl._cl.enqueue_nd_range_kernel}
      100    8.832    0.088  343.422    3.434 /home/vhil0002/anaconda3/envs/newllsm/lib/python3.6/site-packages/l

have run similar profiling for gputools RL deconv. There, most of the time is spent in np.astype

Mar 15 '19 03:03 VolkerH

some more comments about gputools deconvolve (seperate from flowdec). There are some obvious improvements that can be made, e.g. an FFT plan is calculated in the gputools implementation but never used. The PSF is pre-processed and sent to the GPU each time the deconvolution is called. Separating the deconvolution into an init and a run step would allow for processing the PSF once and leaving it on the GPU.

TODO: check whether flowdec sends the PSF to the CPU each time (I believe it does). Maybe that can be optimized as well.

Mar 15 '19 07:03 VolkerH

rewrote maweigerts gputools based convolution to reuse fft-plan, processed psf (remains in GPU ram) and temporary gpu-buffers. Removed unnecessary duplicate .astype(np.complex64). See https://github.com/VolkerH/Lattice_Lightsheet_Deskew_Deconv/blob/benchmarking/lls_dd/deconv_gputools_rewrite.py

Major speed improvement. Actual deconvolution much faster than time per iteration. Will have to read/write from disk in seperate threads.

Mar 15 '19 13:03 VolkerH

Cprofile stats for the above three runs: gputools rewrite: gputools: flowdec:

Mar 15 '19 13:03 VolkerH

Lattice_Lightsheet_Deskew_Deconv Lattice_Lightsheet_Deskew_Deconv copied to clipboard

Performance in batch mode

Lattice_Lightsheet_Deskew_Deconv
Lattice_Lightsheet_Deskew_Deconv copied to clipboard