ome-zarr-py icon indicating copy to clipboard operation
ome-zarr-py copied to clipboard

Two questions about converting larger than memory ND data into ome-zarr

Open dpshepherd opened this issue 3 years ago • 4 comments

Hi all,

Thanks for the hard work on this package and overall on ome-ngff. We are very excited to learn that Dask arrays are now supported!

We have 4D data of shape CZYX, where typically c=17 and dtype=np.uint16. The data is generated by iterative multiplexed light-sheet imaging. The 'zyx' dimension is the same for each channel and is usually large (ranging from [256,50000,50000] to [1000,100000,100000]). The full resolution data for each channel is stored as a Zarr array on disk and can be stacked together using Dask.

Two questions regarding converting this data to ome-zarr:

  1. Should we pre-calculate the multiscale data on our own given the large size? Looking through a few issues and PRs, it isn't clear to us if the Scaler() function in ome-zarr-py performs lazy down-sampling.
  2. Is there a concrete example on how to construct the metadata dictionary that contains the channel names and colors for each channel? We've found good example on the axes and transformations, but was a bit unsure about channels. Sorry if we missed something obvious.

Thanks!

dpshepherd avatar Feb 25 '23 16:02 dpshepherd

The write_image() should be able to handle a dask array and perform lazy downsampling, but we (OME) haven't tested with the size of data you're working with, although others may have done.

The Scaler class only has one way of downsampling for dask arrays, which uses resize from https://github.com/ome/ome-zarr-py/blob/master/ome_zarr/dask_utils.py#L11 to downscale and then write the data to disk: https://github.com/ome/ome-zarr-py/blob/2c4d48972bc3456f72c9a7ba0993c887e80a888d/ome_zarr/writer.py#L496

There was some discussion on the logic for that on the PR: https://github.com/ome/ome-zarr-py/pull/192#issuecomment-1103859096

There is a PR currently open to fix a bug with the resizing of the edge tiles in a dask array at https://github.com/ome/ome-zarr-py/pull/244.

There's also an issue raised about this at https://github.com/ome/ome-zarr-py/issues/237.

No, there's no channels constructor helper methods. Just the example at https://ngff.openmicroscopy.org/latest/#omero-md. Apologies for the minimal docs there. The schema (see https://github.com/ome/ngff/blob/ee4d5dab677636a28f1f65c248a751e279a0d1fe/0.4/schemas/image.schema#L97) specifies that just window and color are required. The window.min and .max are the range of pixel values and the start/end are rendering settings range for black (start) to saturated (end)`.

will-moore avatar Feb 27 '23 14:02 will-moore

Coincidentally, I have an immediate need to parameterize the order parameter which we left at order=1 for the dask skimage rescale function. https://github.com/ome/ome-zarr-py/blob/master/ome_zarr/scale.py#L153 Interestingly, for visualization of raw microscopy intensities, using order>1 preserves good details but for segmentations/labels we need to use low order to prevent interpolation.
I'll probably PR something soon on that. It could be interesting to allow providing one's own external Scaler implementation too - I can't remember if that was ever a thing.

toloudis avatar Feb 27 '23 19:02 toloudis

Hi all,

Thank you both for the info. We are trying with some smaller data first and hit a few technical snags. We'll work on them on our own and come back with more questions.

Thanks again!

dpshepherd avatar Feb 28 '23 01:02 dpshepherd

Hi all,

We ended up writing lazy downsampling code for these large datasets, as the current state of this project attempts to load the entire full-resolution array into memory to calculate the downsamples.

Because we generate the data from our own microscopes and are now doing the downsampling on our own, it makes more sense to re-arrange the existing zarr store and then add the various OME format attributes. Otherwise, we are needlessly copying data between two zarr stores. On that note, addressing issue #258 would help us a lot, because we could then validate.

Thanks for the guidance! I'll try to find a place to host the completed ome-zarr to see how viewing such a large dataset remotely works once everything is working.

dpshepherd avatar Mar 09 '23 15:03 dpshepherd