ro-crate icon indicating copy to clipboard operation
ro-crate copied to clipboard

Use Case: Describe a collection of highly related files

Open multimeric opened this issue 5 months ago • 1 comments
trafficstars

As a researcher, I want to be able to describe a set of related files so that the metadata file does not contain redundant descriptions.

Use Case

Here is a simple example dataset from a MERSCOPE microscope:

$ ls -1 region_R1/images
manifest.json
micron_to_mosaic_pixel_transform.csv
mosaic_DAPI_z0.tif
mosaic_DAPI_z1.tif
mosaic_DAPI_z2.tif
mosaic_DAPI_z3.tif
mosaic_DAPI_z4.tif
mosaic_DAPI_z5.tif
mosaic_DAPI_z6.tif
mosaic_PolyT_z0.tif
mosaic_PolyT_z1.tif
mosaic_PolyT_z2.tif
mosaic_PolyT_z3.tif
mosaic_PolyT_z4.tif
mosaic_PolyT_z5.tif
mosaic_PolyT_z6.tif

According to the user guide:

The images are single channel, single plane, 16-bit grayscale tiff files, with the naming convention mosaic_{stain name}_z{ZIndex}.tif

Now, I could describe every single file here, which would end up with 14 (but in real life, many more) almost identical entities:

[
    {
        "@id": "mosaic_DAPI_z0.tif",
        "@type": "File",
        "encodingFormat": "image/tiff",
        "description": "Mosiac tiff capturing the 0th Z-slice for the DAPI stain."
    },
    {
        "@id": "mosaic_DAPI_z1.tif",
        "@type": "File",
        "encodingFormat": "image/tiff",
        "description": "Mosiac tiff capturing the 1st Z-slice for the DAPI stain."
    },
    ...
]

I also don't like the idea of describing these only as part of the description of the parent Dataset, because then I miss all of the image-specific properties, I lose the ability to run queries like "find all TIFF files", and the Dataset description would become exceedingly long.

Suggestion

One suggestion I have is to allow us to use glob-style patterns to describe sets of files.

One way this might work is simply by allowing an ID which is a glob. For example:

    {
        "@id": "mosaic_DAPI_z*.tif",
        "@type": "File",
        "encodingFormat": "image/tiff",
        "description": "Mosiac tiff capturing a singular Z-slice for the DAPI stain."
    }

The only downside of this is that * is an unusual character in an ID, but it is technically legal in an IRI according to RFC 3987.

Alternatively, we could create a new property called pattern (I'm sure we could find an IRI for it that corresponds to practical usage), which is a glob pattern that selects a set of files. Then we can attach that to a Dataset to capture a subset of files. Then we assume that any property on the Dataset describes any given file within that dataset. For example:

    {
        "@id": "#mosaic-dapi",
        "@type": "Dataset",
        "encodingFormat": "image/tiff",
        "pattern": "mosaic_DAPI_z*.tif",
        "description": "Mosiac tiff capturing a singular Z-slice for the DAPI stain."
    }

I like this less, because it's a bit odd and ugly to attach File properties to a Dataset.

multimeric avatar May 29 '25 06:05 multimeric