aicsimageio
aicsimageio copied to clipboard
CZI tile stitching takes several minutes to complete
System and Software
- aicsimageio Version: 4.0.3
- Python Version: 3.7
- Operating System: Ubuntu linux Focal
Description
Opening a large CZI file (76766 x 131243 = 10.08e9 pixels) with aicsimageio.AICSImage() and then running im.dims takes ~17 min to run.
The image lives on a network share and it's called TBX15-H156N-IC-0001-15012021.czi.
By comparison, doing the same thing with a 10x smaller image (35594 x 38884 = 1.38e9 pixels) takes < 1 s. The smaller image is called ___00000009___00001960_6282021-Rreb1-test-0022.czi.
As a workaround, getting the size using im.metadata.findall() doesn't suffer from that problem.
Expected Behavior
As the image is not supposed to load, I expected this to be almost instant.
Reproduction
A minimal example that exhibits the behavior.
AICSImageIO installed with pip install "aicsimageio[czi] @ git+https://github.com/AllenCellModeling/aicsimageio.git".
import time
import aicsimageio
from aicsimageio.readers.czi_reader import CziReader
# open image (this step is almost instantaneous)
histo_file = "TBX15-H156N-IC-0001-15012021.czi"
time0 = time.time()
im = aicsimageio.AICSImage(histo_file, reader=CziReader)
print('time = ' + str(time.time()-time0) + ' s')
# get number of dimensions (this takes ~17 min)
time0 = time.time()
print(im.dims)
print('time = ' + str(time.time()-time0) + ' s')
It prints
time = 0.17210078239440918 s
<Dimensions [T: 1, C: 1, Z: 1, Y: 76766, X: 131243, S: 3]>
time = 1002.111056804657 s
After taking 17 min to load and display the dimensions, repeat calls to im.dims run in ~ 1e-4 s.
Alternatively, calls to
width = int(im.metadata.findall('./Metadata/Information/Image/SizeX')[0].text)
height = int(im.metadata.findall('./Metadata/Information/Image/SizeY')[0].text)
after the first im.dims are as fast.
By contrast, calling im.metadata right from the start instead of im.dims is a lot faster
time0 = time.time()
im = aicsimageio.AICSImage(histo_file, reader=CziReader)
print('time = ' + str(time.time()-time0) + ' s')
time0 = time.time()
width = int(im.metadata.findall('./Metadata/Information/Image/SizeX')[0].text)
height = int(im.metadata.findall('./Metadata/Information/Image/SizeY')[0].text)
print(str(width) + ' x ' + str(height))
print('time = ' + str(time.time()-time0) + ' s')
The output is
time = 0.16332340240478516 s
131243 x 76766
time = 6.256515741348267 s
Environment
The Ubuntu machine has been set up with this script
https://github.com/rcasero/cytometer/blob/fc69c274cd4a83b670f3429769fca66f914d6681/install_dependencies_machine.sh
And the conda local invironment, set up with this script
https://github.com/rcasero/cytometer/blob/fc69c274cd4a83b670f3429769fca66f914d6681/install_dependencies_user.sh
CZI files available here: https://drive.google.com/drive/folders/1XukPEpCfIvibTT_6992sCgXeszxg-hUK?usp=sharing
File sizes 560M and 2.9G, respectively
I can't confirm the 17 minutes but I can confirm the incredibly long load times on dims.
A bit of explanation and details: almost everything we do in aicsimageio is dependent on the xarray_dask_data property or, in this case, the mosaic_xarray_dask_data property -- almost all other properties and functions are simply routers / syntactic sugar on top of those objects. What this means is that the image data isn't loaded into memory when either of these are called, but under-the-hood, both of these kick off a lot of processing / transforms.
I added some helpful print statements with datetime timestamps to show how long each step takes in setting up the xarray_dask_data (and mosaic_xarray_dask_data) object.
In [1]: from aicsimageio import AICSImage
In [2]: im = AICSImage("/home/maxfield/Downloads/___00000009___00001960_6282021-Rreb1-test-0022.czi")
In [3]: im.dims
2021-07-06 17:05:33.344785 starting aicsimage dims setup
2021-07-06 17:05:33.344869 starting base reader dims setup
2021-07-06 17:05:33.344886 starting raw xarray dask data setup
2021-07-06 17:05:33.424923 creating dask array
2021-07-06 17:05:34.832689 parsing metadata
2021-07-06 17:05:34.835254 starting mosaic xarray dask data setup
2021-07-06 17:05:34.835292 constructing mosaic array
2021-07-06 17:05:34.911897 stitching tiles
2021-07-06 17:06:08.676094 setting stitched metadata
2021-07-06 17:06:08.676326 done
Out[3]: <Dimensions [T: 1, C: 1, Z: 1, Y: 35594, X: 38884, S: 3]>
In [4]: im2 = AICSImage("/home/maxfield/Downloads/TBX15-H156N-IC-0001-15012021.czi")
In [5]: im2.dims
2021-07-06 17:10:04.170339 starting aicsimage dims setup
2021-07-06 17:10:04.170450 starting base reader dims setup
2021-07-06 17:10:04.170478 starting raw xarray dask data setup
2021-07-06 17:10:04.200160 creating dask array
2021-07-06 17:10:13.120453 parsing metadata
2021-07-06 17:10:13.121936 starting mosaic xarray dask data setup
2021-07-06 17:10:13.121978 constructing mosaic array
2021-07-06 17:10:13.976337 stitching tiles
The second image was taking so long to stitch tiles that I just killed the process.
The code path is sort of laid out in the above time prints statements:
- you call
im.dims AICSImage.dims->AICSImage.reader.dimsAICSImage.reader.dims->AICSImage.reader.mosaic_xarray_dask_dataAICSImage.reader.mosaic_xarray_dask_data->AICSImage.reader._create_dask_arrayAICSImage.reader._create_dask_array->AICSImage.reader._construct_mosaicAICSImage.reader._construct_mosaic->AICSImage.reader._stitch_tiles- take everything we just processed and formalize into a single object and the dims
And this is "required" in a large part. Everything living on the xarray object makes it very easy to know where the data "lives". From both of these, all the metadata handling is neglible, getting / creating the dask array took on the order of ~10 seconds, which is fine, although we could improve speed there, but the stitching tiles process was easily the largest task in the data processing stack.
So all of this together, I think this just means we need to spend some time on trying to speed up the CZIReader tile stitching and maybe also the raw dask array construction.
cc @toloudis @heeler
And, I was about to link to our benchmarks that track the time it takes to construct the delayed array but am now seeing that for CZI, it has never worked.... https://allencellmodeling.github.io/aicsimageio/_benchmarks/index.html#benchmark_image_containers.CziReaderSuite.time_delayed_array_construct
Here is a functional "time_delayed_array_construct" for TiffReader: https://allencellmodeling.github.io/aicsimageio/_benchmarks/index.html#benchmark_image_containers.TiffReaderSuite.time_delayed_array_construct
I am first going to make an issue for fixing those benchmarks....
Why should calling "dims" cause all tiles to be stitched?
@rcasero wanted to say that while our tile stitching is slow, you can simply turn off tile stitching with reconstruct_mosaic=False as a parameter. This will return the data with an M dimension. Sorry that this exists in the first place but will try to take a look soon.
@JacksonMaxfield In the end, I had to convert to TIFF, and then use openslide, given time constraints. but I'll keep an eye out for new developments! Thanks!
Hey @rcasero sorry about that! Congrats on the paper though.
@JacksonMaxfield Thanks for all your help!
Looks like @VolkerH is working on some tile stitching with map_blocks: https://github.com/VolkerH/DaskFusion
I'll make a quick note here that I keep forgetting to make. I think there should be some kind of separation between stitching grid images and stitching any other data that are irregular.
For CZI files, the tiles are almost always gridded. An example of plugin that specific to gridded tiles is the functionality available in imageJ. pre-calculation of overlap would speed things greatly. The imagej plugin let the user define the expected overlap and some thresholds.
https://imagej.net/plugins/image-stitching
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.