kerchunk
kerchunk copied to clipboard
TIFF: internal codec (small chunks) vs. entire file as single chunk (`imagecodecs_tiff` codec)
I just read #78 in detail.
In that issue, @cgohlke and others noticed that many tiff files contain tiny internal chunks, and mapping these 1:1 to Zarr chunks is not always the desired behavior (see https://github.com/fsspec/kerchunk/issues/78#issuecomment-916375626). Christophe and Martin suggested that, rather than using the internal tiff chunks, it would be possible to map the entire file to one chunk, using the imagecodecs_tiff
codec.
Is this currently supported by tifffile / kerchunk? If so, how does one activate this option. All of my explorations have yielded internal chunks (e.g. imagecodecs_lzw
).
There is no specific kerchunk backend to do this, since you don't actually need to "scan" anything. I think it would be worthwhile to make a convenience function to produce the one-chunk dataset per input file for the full range of imagecodecs file formats; these could then be passed to combine as usual.
Btw, this is quite related to https://github.com/zarr-developers/zarr-specs/issues/220
It would indeed be interesting to produce shard manifests. It would need to be on-demand, though, since we can't store that much information; at that point, it wouldn't do much more than passing the requested array range down to the loader (which I think imagecodecs would handle anyway).
it wouldn't do much more than passing the requested array range down to the loader
👌 I think this is a key point. We need to figure out how to push the slice operation in zarr down to the codec / compressor.
That is what my "context" thoughts were about :)
We should write out a design doc for this.
@cgohlke , how does the API look on your end, what information do the codecs need to extract only required bytes out of an image file? Is it even possible?
You mean a ZEP, or is there some other place we can hammer this out?
I see this more as a zarr-python software architecture question, but It could eventually intersect with the spec.
rather than using the internal tiff chunks, it would be possible to map the entire file to one chunk, using the imagecodecs_tiff codec. Is this currently supported by tifffile / kerchunk?
For multi-file datasets one can use tifffile.FileSequence->ZarrFileSequenceStore->write_fsspec. It works with many image codecs, not only tiff. There's a very complicated example at earthbigdata.py and a simpler test case.
how does the API look on your end, what information do the codecs need to extract only required bytes out of an image file? Is it even possible?
With imagecodecs, it's only possible (given an index) to extract specific images from multi-image formats like TIFF, JPEGXL, AVIF, and APNG. It is not possible to extract parts of an image/frame.