Save custom `.attrs` in SpatialElements to disk
Hi! Thanks for the great package!
I am currently working with spatialdata to load and parse microscopy files (particularly in the proprietary Carl Zeiss .czi format). For this purpose, it would be extremely useful to persistently store metadata related to the acquisition of the microscopy data (e.g. magnification, resolution, etc.) in the .attrs attribute of the SpatialElement. While I can manipulate the Elements in memory, it is currently not possible to write them to disk (see example)
While it would be possible to write the metadata to a custom tables object, it feels much more natural to have this data directly associated with the image.
Therefore, it would be fantastic if SpatialElements would support writing the .attrs keys to disk.
Example
import spatialdata as sd
import numpy as np
import tempfile
import os
dir = tempfile.tempdir
save_path = os.path.join(dir, "blobs.zarr")
# Load dataset and set custom metadata
blobs = sd.datasets.blobs()
blobs.images["blobs_image"].attrs["metadata"] = {
"info1": 1,
"info2": "2",
"info3": np.zeros(shape=(3, 3))
}
sd.models.Image2DModel().validate(blobs.images["blobs_image"])
# Metadata is in memory
assert "metadata" in blobs.images["blobs_image"].attrs
# Writing to disk leads to loss of the information
blobs.write(save_path)
sdata = sd.read_zarr(save_path)
assert "metadata" not in sdata.images["blobs_image"].attrs
Hi, thanks for reaching out. In principle we could extend the way in which we write SpatialData.attrs to disk https://github.com/scverse/spatialdata/blob/03dbf80f653961ad6e517a823058978d6d3fa169/src/spatialdata/_io/io_zarr.py#L141 also for the various elements (shapes, points; for tables as you observed the uns are already serialized).
What do you think @giovp @timtreis @melonora?
The only issue that I may see is that in our APIs we currently only take care of forwarding the metadata that we define in the format (for instance after cropping an element, the information on the coordinate system are passed to the new cropped object). Not sure if we should ignore the rest of values of .attrs and have the user manually copying them, or if there should be an automatic function. In such case, should we shallow copy or deep copy?
Also, it is a good idea to explore what libraries like pandas, xarray and dask do. To this regard, the situation is a bit unfortunate:
pandasdoesn't have a fully specified/clear semantic: https://github.com/pandas-dev/pandas/issues/52166. Also relevant: https://github.com/pandas-dev/pandas/pull/60357- for
dask, the situation is more unfortunate. The newdask-exprbackend fordaskdataframes removed.attrs, change that is currently giving us headaches 😅 https://github.com/dask/dask/issues/11146 - for
xarraywe didn't have problem so far, but one needs to look deeper into the code.
Hey @lucas-diedrich, if I'm not mistaken, SpatialData has an .attrs level at the highest level, so not directly associated with the individual elements, that is persistent (see https://github.com/scverse/spatialdata/pull/711). Until we make a decision on this discussion here, you could probably convert your "metadata" slot into a dict with the key being the element you're annotating and the value being your info.
Hi Tim, hi Luca, thanks for being so responsive and your input!
Thanks, that's a great suggestion, I'll do that in the meantime.
soo I'm trying to get back in the dev discussion of the ecosystem and this is a first attempt
I think if we support the serialization of attrs the way uns does it, we might encounter some challenges in the decision of what we serialize. For adata.uns for example, it supported some serialization, but then issues raised during the years because people started to want more support, or encountered bugs. Even for the example above,
blobs.images["blobs_image"].attrs["metadata"] = {
"info1": 1,
"info2": "2",
"info3": np.zeros(shape=(3, 3))
}
an humble array already raises a bunch of options on how should that be serialized.....
so I guess my position is not against it in principle, but we need to undersatnd what we will get into...
an humble array already raises a bunch of options on how should that be serialized..... so I guess my position is not against it in principle, but we need to undersatnd what we will get into...
@giovp good point! Maybe we could overcome problems like this by requiring .attrs to be serializable from and to JSON. pandas requires this (it's a bit of a hidden requirement that I found at some point while navigating the issues in pandas; I can't find the exact reference for it). If we require this, we could allow users to store any object, as long as there is a JSON representation that we can construct during IO. To achieve this, we could subclass JSONEncoder, or maybe require the implementation of some methods that we can discover, such to_json() and from_json() (more on this [here](https://stackoverflow.com/questions/3768895/how-to-make-a-class-json-serializable)).
Or in alternative, for the time being, we should maybe endorse the workaround suggested by @timtreis and choose not implement this feature. The rationale for this would be that the serialization of .attrs seems actually not to be implemented in pandas as well: https://github.com/pandas-dev/pandas/issues/20521 https://github.com/pandas-dev/pandas/issues/53577 https://github.com/pandas-dev/pandas/issues/51012 https://github.com/pandas-dev/pandas/issues/34596.
I found it, this is the requirement for .attrs in pandas to be serializable to JSON: https://github.com/pandas-dev/pandas/issues/51012#issuecomment-1743856513.
Maybe for context: At least for my specific use case the metadata is indeed expected to be json serializable. For your reference, I attached some example data from two different parsers (carl-zeiss pyCZIlibrw for .czi files, openslide for .mirax data)