steinbock
steinbock copied to clipboard
Feature request: saving images in other file formats
This is more to track a discussion:
Would you consider supporting the storage of images in other file formats, e.g. .h5?
I'm just asking since cytomapper
could directly link to those without reading images into memory.
Good question! Steinbock uses good ol' TIFF to ensure interoperability with other tools, especially image viewers. But I agree with you that it's definitely not the optimal format (although steinbock can and does memory-map TIFF files in some use cases). Ideally, formats like OME-NGFF were available already, but we'll have to wait a bit for that.
Regarding the HDF5 container format I'm a bit hesitant, as there is no "standard" for how to encode images in such a file, afaik. It seems that every tool (e.g. Ilastik, CellProfiler) encodes images in their own way, sometimes partially compatible with each other. Maybe you have more insight/suggestions there?
Yes, TIFF is definitely the preferred format for interoperability. And I fully agree on HDF5 - there's no recommended way of storing images. However, python and R have readers available that are quite flexible which makes this file format relatively user friendly. So you can decide on the format, document the specifications and we'll just want to make sure that one can access the data programmatically. I'll have a look how easy it is to access the ilastik image crops - maybe you could stick to this. I'll get back to you.
Ok, I checked the crops and they look fine to me. The dataset name of each file is crop
so they are easy to read in with cytomapper
. I would propose to add a flag something like --filetype hdf5
(default --filetype tiff
) to each call that generates images (multi- and single-channel) and store the dataset under a default name (e.g. img
or mask
). But of course only if you think that's reasonable ;) and also no rush with this.
Yes, TIFF is definitely the preferred format for interoperability. And I fully agree on HDF5 - there's no recommended way of storing images. However, python and R have readers available that are quite flexible which makes this file format relatively user friendly.
It seems somewhat counter-intuitive to me to use a container format on top of a directory structure. When storing image data in a container file format, I'd find it more intuitive to store related data (images, masks, crops, single-cell data, ...) in a single file, which btw is also what @mezwick suggested at some point. But I do get the point of enabling other software like cytomapper to more easily do on-disk operations, so I'll add support for "HDF5 images" to steinbock in an upcoming release.
Ok, I checked the crops and they look fine to me. The dataset name of each file is
crop
so they are easy to read in withcytomapper
.
Indeed, the HDF5 image data produced by steinbock for Ilastik is undocumented (on purpose), see https://bodenmillergroup.github.io/steinbock/latest/cli/classification/#data-preparation. You can have a look at https://github.com/BodenmillerGroup/steinbock/blob/548ccf69e80a1ec3cb144a16ec67070fcab5474c/steinbock/classification/ilastik/_ilastik.py#L76-L95 how these are generated.
I would propose to add a flag something like
--filetype hdf5
(default--filetype tiff
) to each call that generates images (multi- and single-channel) and store the dataset under a default name (e.g.img
ormask
).
Would you suggest to use a command-line option (i.e., can be specified individually for each command) or an environment variable (i.e., we'd expect to always write images in the same file format)?
It seems somewhat counter-intuitive to me to use a container format on top of a directory structure. When storing image data in a container file format, I'd find it more intuitive to store related data (images, masks, crops, single-cell data, ...) in a single file, which btw is also what @mezwick suggested at some point. But I do get the point of enabling other software like cytomapper to more easily do on-disk operations, so I'll add support for "HDF5 images" to steinbock in an upcoming release.
Yes, indeed, storing all data in a single file is the more elegant approach. And cytomapper
wouldn't have a problem with it as long as the dataset names are specified. But here I would stick to the general steinbock
approach where each function call produces new files.
Would you suggest to use a command-line option (i.e., can be specified individually for each command) or an environment variable (i.e., we'd expect to always write images in the same file format)?
I would use the command-line option as users (me ;)) might not want to store all images in a certain format. Maybe adding it to the "exports" would also be an option. So keep TIFF as the main image format but allow HDF5 export.