aicsimageio icon indicating copy to clipboard operation
aicsimageio copied to clipboard

CZI tile stitching takes several minutes to complete

Open rcasero opened this issue 4 years ago • 10 comments
trafficstars

System and Software

  • aicsimageio Version: 4.0.3
  • Python Version: 3.7
  • Operating System: Ubuntu linux Focal

Description

Opening a large CZI file (76766 x 131243 = 10.08e9 pixels) with aicsimageio.AICSImage() and then running im.dims takes ~17 min to run. The image lives on a network share and it's called TBX15-H156N-IC-0001-15012021.czi.

By comparison, doing the same thing with a 10x smaller image (35594 x 38884 = 1.38e9 pixels) takes < 1 s. The smaller image is called ___00000009___00001960_6282021-Rreb1-test-0022.czi.

As a workaround, getting the size using im.metadata.findall() doesn't suffer from that problem.

Expected Behavior

As the image is not supposed to load, I expected this to be almost instant.

Reproduction

A minimal example that exhibits the behavior.

AICSImageIO installed with pip install "aicsimageio[czi] @ git+https://github.com/AllenCellModeling/aicsimageio.git".

import time
import aicsimageio
from aicsimageio.readers.czi_reader import CziReader

# open image (this step is almost instantaneous)
histo_file = "TBX15-H156N-IC-0001-15012021.czi"
time0 = time.time()
im = aicsimageio.AICSImage(histo_file, reader=CziReader)
print('time = ' + str(time.time()-time0) + ' s')

# get number of dimensions (this takes ~17 min)
time0 = time.time()
print(im.dims)
print('time = ' + str(time.time()-time0) + ' s')

It prints

time = 0.17210078239440918 s
<Dimensions [T: 1, C: 1, Z: 1, Y: 76766, X: 131243, S: 3]>
time = 1002.111056804657 s

After taking 17 min to load and display the dimensions, repeat calls to im.dims run in ~ 1e-4 s.

Alternatively, calls to

width = int(im.metadata.findall('./Metadata/Information/Image/SizeX')[0].text)
height = int(im.metadata.findall('./Metadata/Information/Image/SizeY')[0].text)

after the first im.dims are as fast.

By contrast, calling im.metadata right from the start instead of im.dims is a lot faster

time0 = time.time()
im = aicsimageio.AICSImage(histo_file, reader=CziReader)
print('time = ' + str(time.time()-time0) + ' s')
time0 = time.time()
width = int(im.metadata.findall('./Metadata/Information/Image/SizeX')[0].text)
height = int(im.metadata.findall('./Metadata/Information/Image/SizeY')[0].text)
print(str(width) + ' x ' + str(height))
print('time = ' + str(time.time()-time0) + ' s')

The output is

time = 0.16332340240478516 s
131243 x 76766
time = 6.256515741348267 s

Environment

The Ubuntu machine has been set up with this script

https://github.com/rcasero/cytometer/blob/fc69c274cd4a83b670f3429769fca66f914d6681/install_dependencies_machine.sh

And the conda local invironment, set up with this script

https://github.com/rcasero/cytometer/blob/fc69c274cd4a83b670f3429769fca66f914d6681/install_dependencies_user.sh

rcasero avatar Jul 06 '21 10:07 rcasero

CZI files available here: https://drive.google.com/drive/folders/1XukPEpCfIvibTT_6992sCgXeszxg-hUK?usp=sharing

File sizes 560M and 2.9G, respectively

rcasero avatar Jul 06 '21 14:07 rcasero

I can't confirm the 17 minutes but I can confirm the incredibly long load times on dims.

A bit of explanation and details: almost everything we do in aicsimageio is dependent on the xarray_dask_data property or, in this case, the mosaic_xarray_dask_data property -- almost all other properties and functions are simply routers / syntactic sugar on top of those objects. What this means is that the image data isn't loaded into memory when either of these are called, but under-the-hood, both of these kick off a lot of processing / transforms.

I added some helpful print statements with datetime timestamps to show how long each step takes in setting up the xarray_dask_data (and mosaic_xarray_dask_data) object.

In [1]: from aicsimageio import AICSImage

In [2]: im = AICSImage("/home/maxfield/Downloads/___00000009___00001960_6282021-Rreb1-test-0022.czi")

In [3]: im.dims
2021-07-06 17:05:33.344785 starting aicsimage dims setup
2021-07-06 17:05:33.344869 starting base reader dims setup
2021-07-06 17:05:33.344886 starting raw xarray dask data setup
2021-07-06 17:05:33.424923 creating dask array
2021-07-06 17:05:34.832689 parsing metadata
2021-07-06 17:05:34.835254 starting mosaic xarray dask data setup
2021-07-06 17:05:34.835292 constructing mosaic array
2021-07-06 17:05:34.911897 stitching tiles
2021-07-06 17:06:08.676094 setting stitched metadata
2021-07-06 17:06:08.676326 done
Out[3]: <Dimensions [T: 1, C: 1, Z: 1, Y: 35594, X: 38884, S: 3]>

In [4]: im2 = AICSImage("/home/maxfield/Downloads/TBX15-H156N-IC-0001-15012021.czi")

In [5]: im2.dims
2021-07-06 17:10:04.170339 starting aicsimage dims setup
2021-07-06 17:10:04.170450 starting base reader dims setup
2021-07-06 17:10:04.170478 starting raw xarray dask data setup
2021-07-06 17:10:04.200160 creating dask array
2021-07-06 17:10:13.120453 parsing metadata
2021-07-06 17:10:13.121936 starting mosaic xarray dask data setup
2021-07-06 17:10:13.121978 constructing mosaic array
2021-07-06 17:10:13.976337 stitching tiles

The second image was taking so long to stitch tiles that I just killed the process.

The code path is sort of laid out in the above time prints statements:

  • you call im.dims
  • AICSImage.dims -> AICSImage.reader.dims
  • AICSImage.reader.dims -> AICSImage.reader.mosaic_xarray_dask_data
  • AICSImage.reader.mosaic_xarray_dask_data -> AICSImage.reader._create_dask_array
  • AICSImage.reader._create_dask_array -> AICSImage.reader._construct_mosaic
  • AICSImage.reader._construct_mosaic -> AICSImage.reader._stitch_tiles
  • take everything we just processed and formalize into a single object and the dims

And this is "required" in a large part. Everything living on the xarray object makes it very easy to know where the data "lives". From both of these, all the metadata handling is neglible, getting / creating the dask array took on the order of ~10 seconds, which is fine, although we could improve speed there, but the stitching tiles process was easily the largest task in the data processing stack.

So all of this together, I think this just means we need to spend some time on trying to speed up the CZIReader tile stitching and maybe also the raw dask array construction.

cc @toloudis @heeler

evamaxfield avatar Jul 06 '21 17:07 evamaxfield

And, I was about to link to our benchmarks that track the time it takes to construct the delayed array but am now seeing that for CZI, it has never worked.... https://allencellmodeling.github.io/aicsimageio/_benchmarks/index.html#benchmark_image_containers.CziReaderSuite.time_delayed_array_construct

Here is a functional "time_delayed_array_construct" for TiffReader: https://allencellmodeling.github.io/aicsimageio/_benchmarks/index.html#benchmark_image_containers.TiffReaderSuite.time_delayed_array_construct

I am first going to make an issue for fixing those benchmarks....

evamaxfield avatar Jul 06 '21 17:07 evamaxfield

Why should calling "dims" cause all tiles to be stitched?

toloudis avatar Jul 07 '21 19:07 toloudis

@rcasero wanted to say that while our tile stitching is slow, you can simply turn off tile stitching with reconstruct_mosaic=False as a parameter. This will return the data with an M dimension. Sorry that this exists in the first place but will try to take a look soon.

evamaxfield avatar Jul 16 '21 00:07 evamaxfield

@JacksonMaxfield In the end, I had to convert to TIFF, and then use openslide, given time constraints. but I'll keep an eye out for new developments! Thanks!

rcasero avatar Jul 30 '21 23:07 rcasero

Hey @rcasero sorry about that! Congrats on the paper though.

evamaxfield avatar Aug 03 '21 20:08 evamaxfield

@JacksonMaxfield Thanks for all your help!

rcasero avatar Aug 09 '21 00:08 rcasero

Looks like @VolkerH is working on some tile stitching with map_blocks: https://github.com/VolkerH/DaskFusion

evamaxfield avatar Sep 03 '21 17:09 evamaxfield

I'll make a quick note here that I keep forgetting to make. I think there should be some kind of separation between stitching grid images and stitching any other data that are irregular.

For CZI files, the tiles are almost always gridded. An example of plugin that specific to gridded tiles is the functionality available in imageJ. pre-calculation of overlap would speed things greatly. The imagej plugin let the user define the expected overlap and some thresholds.

https://imagej.net/plugins/image-stitching

MosGeo avatar Oct 14 '21 09:10 MosGeo

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 29 '23 02:03 github-actions[bot]