pathml
pathml copied to clipboard
Parameterize dtype for h5path with `SlideData` constructor
Currently, PathML stores all images with float16
, forcing all image inputs to be upcast or downcast to this data type, which increases storage size or loses information. There already is a dtype
parameter in the SlideData
constructor, but it's only used to assist the BioFormatsBackend
in loading images correctly. This repurposes that parameter to control what dtype h5py uses when writing image data.
I also changed masks to stored as ENUM and use the strongest compression setting as boolean masks are highly compressible and easily compressed. The compression made a huge difference in file size, and using (HDFView)[https://www.hdfgroup.org/downloads/hdfview/] showed a compression ratio of 100-200x for masks. The ENUM data type is stored as an 8-bit integer (https://docs.h5py.org/en/stable/special.html#enumerated-types) but at least this is less than using float16
.
@jacob-rosenthal Do I remember you saying that certain transforms expect float16? If you remember which we should tag and document these
Thanks @tddough98 , this looks great! The reason we had to add the dtype parameter in the first place was as a workaround because some of our collaborators were running into problems with images where the dtype from the metadata didn't match the dtype of the image data itself. So the parameter was added so that people could read images in e.g. float16 even if the image metadata said int8. From what I can tell looking at the code, this PR will still keep that behavior? (Side note, we should add an explicit test for this using one of the misi images to be sure that they are always supported, maybe @MohamedOmar2020 you can help provide a small misi image we can add to the test suite?)
And yes, @ryanccarelli I think we did run into problems with some of the image transforms being picky about dtypes, and found that float16 worked well usually. I think since float16 remains the default, this should still be fine, since things would only break if people manually change dtype themselves
So from what I can tell, this PR seems like strictly an improvement, keeping existing behavior but making it cleaner & more flexible
I reverted storing all masks as boolean. I was focused on masks made by TissueDetectionHE
, which are always boolean, but now realize that instance segmentation masks from mesmer or hovernet need to use different values for each instance. Now, masks are saved with the same datatype as the numpy array for the mask.
Ah yes that is a good catch. I think that should have been caught by the tests? But looks like the tests haven't run for this PR yet, I guess because it is a PR that isn't merging into dev or master (the only branches that trigger tests).