VisCy icon indicating copy to clipboard operation
VisCy copied to clipboard

Organize segmentations, tracking, annotations, and embeddings with DynaCell zarr stores

Open mattersoflight opened this issue 4 months ago • 10 comments

We would like to organize DynaCell dataset such that analysis results and annotations are carried inside the same .zarr folder. We arrived at following structure: fov.zarr

  • image/
  • labels/
  • tracks/
  • imaging_metadata.json

The image and labels arrays will follow the OME-zarr spec and imaging_metadata.json will follow the schema being developed by imaging data working group. For tracks, we are considering new geff format.

Above decisions make sense, but how should we store and link point annotations?

The point annotations are linked to specific instanes of cells (imagine a point at the centroid of the cell) and include human label of cell state, embeddings and classification labels produced by models.

Linking embeddings, class labels, or human annotations via track id or label id runs the risk of annotations going out of sync when the segmentation and tracking are improved and updated. Is there a community standard format for storing feature vectors for a point?

mattersoflight avatar Sep 09 '25 23:09 mattersoflight

The suggestion above resembles the Spatialdata format. Let's evaluate if this is flexible and performant for our model training.

  • https://spatialdata.scverse.org/en/stable/tutorials/notebooks/notebooks/examples/intro.html
  • https://liveimagetrackingtools.org/geff/latest/specification/#extra-attributes

edyoshikun avatar Sep 15 '25 15:09 edyoshikun

@edyoshikun

SpatialData is a pretty good format. One of the issues is they only support CZYX or CYX data formats for the Image (FOV) as of right now.

srivarra avatar Sep 15 '25 16:09 srivarra

Also, from our AI@MBL projects phase. It seems SpatialData adds a bit of unnecesary overhead. There is a set of transforms that need to be applied or parsed so that they are all share the same frame of reference.

This is one example project https://github.com/afoix/vaery-unsupervised/tree/km_training/vaery_unsupervised

edyoshikun avatar Sep 15 '25 16:09 edyoshikun

@edyoshikun There also isn't really a good way to convert an OME-Zarr to a SpatialData object as of now.

But I think Luca Marconato is working on the foundations of that in ome-zarr-models-py, see: #repo-management > Core dev activity summary @ 💬. This is to align SpatialData with OME-Zarr.

srivarra avatar Sep 15 '25 17:09 srivarra

Related: https://github.com/ome/ngff/pull/64 and https://github.com/fractal-analytics-platform/fractal-tasks-core/pull/582.

ziw-liu avatar Sep 15 '25 17:09 ziw-liu

fractal-analytics-platform/fractal-tasks-core#582.

Fractal's table system seems nice

image.zarr        # Zarr group for a NGFF image
|
├── 0             # Zarr array for multiscale level 0
├── ...
├── N             # Zarr array for multiscale level N
|
├── labels        # Zarr subgroup with a list of labels associated to this image
|   ├── label_A   # Zarr subgroup for a given label
|   ├── label_B   # Zarr subgroup for a given label
|   └── ...
|
└── tables        # Zarr subgroup with a list of tables associated to this image
    ├── table_1   # Zarr subgroup for a given table
    ├── table_2   # Zarr subgroup for a given table
    └── ...

Maybe we can incorporate it into something like this:

├── 123.zarr                  # One OME-Zarr image (id=123).
│   ...
│
└── 456.zarr                  # Another OME-Zarr image (id=456).
    │
    ├── zarr.json             # Each image is a Zarr group of other groups and arrays.
    │                         # Group level attributes are stored in the zarr.json file and include
    │                         # "multiscales" and "omero" (see below).
    │
    ├── 0                     # Each multiscale level is stored as a separate Zarr array,
    │   ...                   # which is a folder containing chunk files which compose the array.
    ├── n                     # The name of the array is arbitrary with the ordering defined by
    │   │                     # by the "multiscales" metadata, but is often a sequence starting at 0.
    │   │
    │   ├── zarr.json         # All image arrays must be up to 5-dimensional
    │   │                     # with the axis of type time before type channel, before spatial axes.
    │   │
    │   └─ ...                # Chunks are stored conforming to the Zarr array specification and 
    │                         # metadata as specified in the array's zarr.json.
    │
    ├── labels
    │   │
    │   ├── zarr.json         # The labels group is a container which holds a list of labels to make the objects easily discoverable
    │   │                     # All labels will be listed in zarr.json e.g. { "labels": [ "original/0" ] }
    │   │                     # Each dimension of the label should be either the same as the
    │   │                     # corresponding dimension of the image, or 1 if that dimension of the label
    │   │                     # is irrelevant.
    │   │
    │   └── original          # Intermediate folders are permitted but not necessary and currently contain no extra metadata.
    │       │
    │       └── 0             # Multiscale, labeled image. The name is unimportant but is registered in the "labels" group above.
    │           ├── zarr.json # Zarr Group which is both a multiscaled image as well as a labeled image.
    │           │             # Metadata of the related image and as well as display information under the "image-label" key.
    │           │
    │           ├── 0         # Each multiscale level is stored as a separate Zarr array, as above, but only integer values
    │           └── ...       # are supported.
    │
    ├── tables                # Tables (optional)
    │   ├── table_1          # Zarr subgroup for a given table
    │   └── table_2          # Zarr subgroup for a given table
    │
    └── tracks                # Tracks (optional) - following Geff specification
        └── tracking_graph_1   # Zarr group containing the graph data
            ├── zarr.json     # Contains geff metadata including version, directed flag, axes info
            ├── nodes         # Nodes group
            │   ├── ids       # 1D array of node IDs
            │   └── props     # Optional node properties group
            │       ├── t     # Time coordinate property
            │       │   └── values
            │       ├── x     # X coordinate property
            │       │   └── values
            │       ├── y     # Y coordinate property
            │       │   └── values
            │       └── z     # Z coordinate property (optional for 3D)
            │           └── values
            └── edges         # Edges group
                ├── ids       # 2D array shape (E, 2) for edge connections
                └── props     # Optional edge properties group
                    └── distance  # Example edge property
                        └── values
And for a whole HCS
.
│
└── 5966.zarr                 # One OME-Zarr plate (id=5966)
    ├── zarr.json             # Implements "plate" specification
    ├── A                     # First row of the plate
    │   ├── zarr.json
    │   │
    │   ├── 1                 # First column of row A
    │   │   ├── zarr.json     # Implements "well" specification
    │   │   │
    │   │   ├── 0             # First field of view of well A1
    │   │   │   │
    │   │   │   ├── zarr.json # Implements "multiscales", "omero"
    │   │   │   ├── 0         # Resolution levels          
    │   │   │   ├── ...
    │   │   │   ├── labels/    # Labels (optional)
    │   │   │   ├── tables/    # Tables AnnData (optional)
    │   │   │   └── tracks/    # Tracks w/ Geff (optional)
    │   │   └── ...           # Other fields of view
    │   └── ...               # Other columns
    └── ...                   # Other rows

srivarra avatar Sep 15 '25 17:09 srivarra

As a first step for the annotations, we will:

  • @srivarra check the EmbeddingWriter(), load_annotation() for writing the Ultrack tracking data and the features.
  • Standardize the input (e.g fov_name , label_infection_state, label_division_state) set of strings.
  • Standardize non-mutable category codes with dicts
  • Remove the leading slash on the fov_name

edyoshikun avatar Sep 16 '25 19:09 edyoshikun

https://github.com/mehta-lab/VisCy/pull/274 We also identified that this preprocessing file should be appended to the tracking.csv used during training.

The statistics will vary depending on the patch_sizes.

edyoshikun avatar Sep 16 '25 21:09 edyoshikun

There's also TreeData which is a thin wrapper around AnnData; it's designed to be useful for tracking / lineage work.

srivarra avatar Sep 22 '25 20:09 srivarra

icechunk might be of interest. It's a transactional storage engine for Zarr. Quote, directly from the docs "Icechunk data are safe to read and write in parallel from multiple uncoordinated processes. This allows Zarr to be used more like a database". Their blog post. Maybe this can help with the concurrency?

From Sep 29 2025 Meeting with @giovp:

srivarra avatar Sep 29 '25 23:09 srivarra