spatialdata
spatialdata copied to clipboard
Re-enable compression
Compression was (and it is still disabled) because of this issue: https://github.com/ome/ome-zarr-py/issues/219
The new ome-zarr-py version should have this addressed. Check this and re-enable the compression again.
@LucaMarconato , is this still an issue?
Yes, compression is still not enabled.
@LucaMarconato how about now?
it looks like it is, but we don't expose storage_options anymore.
Just to report, default compression is lz4 currently with compression level 5. lz4 is optimized for speed of compression / decompression and low memory requirements. Typically zstd provides best balance of compression ratio and speed so question is which one do we want to make the default? Also should we expose other options than these 2? For example xz compression?
Thanks for the info. I would keep lz4 as default and as discussed in the call, add an explicit argument that mentions this default, and with at least two other options: no compression and zstd. In this way we avoid to expose storage_options but still we provide control of the compression levels.
Dear all, currently I am diving into the use of spatialdata and are now exactly at the point where I wanted to use compression in storage options. I am not sure if this fits here in the issue. If not let me know and I will ask somewhere else. Here is my scenario: I have standard visium data, not a problem which I want to align with several wsi stored in the vsi format. This format uses compression (i guess jpeg) to keep the data small. This is debatable but makes the data still handy and well enough readable for our research. Now my workflow i the following. I export the data deposited on an omero instance via omero zarr export and read them except some metadata details with :
import os
from ome_zarr.io import parse_url
from ome_zarr.reader import Reader
import dask.array as da
import xarray as xr
from spatialdata.models import Image2DModel
from spatialdata.transformations import Scale
from spatial_image import SpatialImage
# exists from visium import
spatial_data
...
reader = Reader(parse_url(zarr_path))
....
# read in the highest resolution
raw = da.from_zarr(f"{zarr_path}/0")
img_array = _slice_to_2d(raw)
xr_img = xr.DataArray(img_array, dims=("c", "y", "x"))
sp_img = SpatialImage(xr_img)
# use Image2DModel parse to load the data and generate different pyramidal resolutions with spatial_data
img_model = Image2DModel.parse(
sp_img,
transformations={marker: Scale([x_scale, y_scale], ["x", "y"])},
rgb=None,
chunks=(img_array.shape[0], 256, 256),
scale_factors=[2] * number_of_scales
)
spatial_data.images[marker] = img_model
Now after adding several images with the name marker I use spatial_data.write()
to generate a portable self-contained file that has all information. But this now has two main problems. First the number of inodes on a HPC explode (i would love to write to .zip stores) but also the size is now >100 Gb per Visium experiment. Even though everything is there we still cannot really work with these datasets. Maybe we just need bigger computers. What is the recommended workflow?
My first idea was to use spatial_data.write(compression=jpeg2000) or something. Or should I write down the wsi somewhere else and not store everything with the high level .write() function? But if so I would be grateful (to also participate in generating) about a documention/tutorial to help people organize their data appropriately.
so a couple of things, we want to have zarr v3 support in SpatialData that would help you on an hpc with the nested directories, but it is a bit hard to give a time line for that right now. Zarr v3 supports what is called sharding (for the anndata part see here: https://anndata.readthedocs.io/en/latest/tutorials/zarr-v3.html).
For the compression part, zarr can allow for different compression options, but I think there were some reasons why we initially did not support it anymore. However, I think it would be relatively easy to reimplement and I could take that on. @LucaMarconato thoughts?
thanks for your quick response! I highly appreciate these well maintained repos. I read about zarr v3 as well as the uncertainty in the timeline. that will be a very important feature but i can work around it for now.
re the compression part. If it would be feasible I think it would be an important feature as well. Maybe from a data perspective this does not make too much sense. But let me draw a use case. In e.g. pathology people doing microscopy are usually not well equipped with high performant local working stations. This feature would allow us to interactively visualize even on "normal" computer (e.g. 16 Gb RAM, 1 Tb SSD) compressed images (even lossy is for a human eye in my opinion still more helpful than downsampled layers with regard to histology) and ask spatial questions based on images and not only on the transcriptomics data. These well informed annotations could afterwards go back on a lossless version and feed back machine learning algorithms whatsoever. Additionally, initial lossy recordings do not need to be blowed up again. The information is anyhow not there any more.
Actually another solution of my use case,unrelated to this issue, would be full support of napari-omero in spatialdata as Image2D objects. Currently I could just visualize these images in parallel but they wouldn't be aligned to the spatial omics data.
well aware of the data sizes hehe. For visualization though have you tried chunked multiscale images in SpatialData? I assume your data is 2d?
In any case, I will look at reimplementation of compressions now and keep you posted
yes, yes and yes :) the chunked data are really great and work well. But then the pyramids also explode if not compressed ;) chunks=(img_array.shape[0], 256, 256), scale_factors=[2] * number_of_scales Maybe my chunks are not optimal, but for my setup this is working smoothly and the ram problem is solved. simply the >100Gb per spatialdata object on disk make it difficult to work.
ok given that the issue above seems to be fixed I will check for reimplementation of this and I think I can still report back this week.
sorry taking a bit longer, but working on it. @LucaMarconato I am option against no compression because in ome-zarr from what I see this would default to zlib. The lz4 default we have right now is the default for blosc and offers fast compression (albeit low compression level) and decompression. So I would say lets just expose these 2 and the compression levels.
@J-Franz I opened #944
@melonora thank you for reactivating the compression. I tried it for my files and it was working smoothly. Great! Still, my specific problem is not fully solved as the available lossless compressors are just not really strong enough. E.g. switching from standard compression to level 9 compression reduces file size from 62.7 Gb to 49.5 Gb in one example. That does not solve the problem that larger projects to be annotated require >1 Tb SSD storage to effectively use napari. But I admit that this leads to ome-zarr related discussions on lossy compression and is maybe suited better to be discussed there. E.g. https://pypi.org/project/zarr-jpeg2k/ should in general open this possibility, right? For spatial data this might turn into a relevant issue as people like me will tend to incorporate more and more wsi in one file only. This might reduce portability of spatialdata objects. For me only a 8 Tb SSD solved the issue.
@J-Franz Yeah, here we indeed get in a realm of a way broader discussion on when is it allowed to have a lossy compression. Particularly with deep learning applications and the concept of FAIR etc. It is hard to estimate what information you potentially lose. Regarding the portability, short term yes, but this is not just a SpatialData problem. Long term there is the move to zarr v3 which allows for sharding which would increase the portability (though not decrease the size).
Right now as far as I am aware ome-zarr does not support the jpeg2k compression, but only compressions in Blosc, which are all lossless. The only higher compression in there is 'lz4hc' but then you really get a performance hit because of increased time for decompression when visualizing.