kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

TIFF: internal codec (small chunks) vs. entire file as single chunk (`imagecodecs_tiff` codec)

Open rabernat opened this issue 1 year ago • 10 comments

I just read #78 in detail.

In that issue, @cgohlke and others noticed that many tiff files contain tiny internal chunks, and mapping these 1:1 to Zarr chunks is not always the desired behavior (see https://github.com/fsspec/kerchunk/issues/78#issuecomment-916375626). Christophe and Martin suggested that, rather than using the internal tiff chunks, it would be possible to map the entire file to one chunk, using the imagecodecs_tiff codec.

Is this currently supported by tifffile / kerchunk? If so, how does one activate this option. All of my explorations have yielded internal chunks (e.g. imagecodecs_lzw).

rabernat avatar Apr 11 '23 15:04 rabernat

There is no specific kerchunk backend to do this, since you don't actually need to "scan" anything. I think it would be worthwhile to make a convenience function to produce the one-chunk dataset per input file for the full range of imagecodecs file formats; these could then be passed to combine as usual.

martindurant avatar Apr 11 '23 15:04 martindurant

Btw, this is quite related to https://github.com/zarr-developers/zarr-specs/issues/220

rabernat avatar Apr 11 '23 15:04 rabernat

It would indeed be interesting to produce shard manifests. It would need to be on-demand, though, since we can't store that much information; at that point, it wouldn't do much more than passing the requested array range down to the loader (which I think imagecodecs would handle anyway).

martindurant avatar Apr 11 '23 15:04 martindurant

it wouldn't do much more than passing the requested array range down to the loader

👌 I think this is a key point. We need to figure out how to push the slice operation in zarr down to the codec / compressor.

rabernat avatar Apr 11 '23 15:04 rabernat

That is what my "context" thoughts were about :)

martindurant avatar Apr 11 '23 15:04 martindurant

We should write out a design doc for this.

rabernat avatar Apr 11 '23 16:04 rabernat

@cgohlke , how does the API look on your end, what information do the codecs need to extract only required bytes out of an image file? Is it even possible?

martindurant avatar Apr 11 '23 16:04 martindurant

You mean a ZEP, or is there some other place we can hammer this out?

martindurant avatar Apr 11 '23 17:04 martindurant

I see this more as a zarr-python software architecture question, but It could eventually intersect with the spec.

rabernat avatar Apr 11 '23 18:04 rabernat

rather than using the internal tiff chunks, it would be possible to map the entire file to one chunk, using the imagecodecs_tiff codec. Is this currently supported by tifffile / kerchunk?

For multi-file datasets one can use tifffile.FileSequence->ZarrFileSequenceStore->write_fsspec. It works with many image codecs, not only tiff. There's a very complicated example at earthbigdata.py and a simpler test case.

how does the API look on your end, what information do the codecs need to extract only required bytes out of an image file? Is it even possible?

With imagecodecs, it's only possible (given an index) to extract specific images from multi-image formats like TIFF, JPEGXL, AVIF, and APNG. It is not possible to extract parts of an image/frame.

cgohlke avatar Apr 11 '23 20:04 cgohlke