open_virtual_dataset fails to open tiffs
open_virtual_dataset says it supports TIFF files however it currently fails and is lacking test coverage. The issue starts here where the kerchunk reference to a zarr array is treated as an HDF reference that has one or more groups.
Code to reproduce
from pathlib import Path
import numpy as np
from PIL import Image
from virtualizarr import open_virtual_dataset
from virtualizarr.backend import FileType
def write_random_tiff(file_path):
array = np.random.randint(0, 255, (128, 128), dtype=np.uint8)
img = Image.fromarray(array)
img.save(file_path)
file_path = Path.cwd() / "test.tif"
write_random_tiff(file_path)
open_virtual_dataset(str(file_path), filetype=FileType.tiff)
Throws:
File virtualizarr/translators/kerchunk.py:46, in extract_group(vds_refs, group)
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []
virtualizarr = 1.1.0
Ah sorry about this @elyall !
We do have some rudimentary tests for tiff but as the tiff reader is just a thin wrapper around the kerchunk tiff reader I (naively) expected this to "just work". I agree that we should remedy that with better test coverage.
Thanks for the reproducible example, I'll have a look at what might be going on now.
So I've looked into this, and I'm a bit confused by what kerchunk returns, and am ignorant of exactly what information is stored in a tiff file. The kerchunk tiff reader basically returns this information about your example tiff file:
{'refs': {'.zattrs': '{"_ARRAY_DIMENSIONS":["Y","X"]}', '.zarray': '{\n "chunks": [\n 128,\n 128\n ],\n "compressor": null,\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n 128,\n 128\n ],\n "zarr_format": 2\n}', '0.0': ['/private/var/folders/zt/b84f41811p5_vygj1gjp6qx40000gn/T/pytest-of-tom/pytest-38/test_random_tiff0/rand.tiff', 122, 16384]}}
VirtualiZarr's tiff reader then attempts to parse this and wrap it up into a nicer xarray.Dataset containing virtualizarr.ManifestArray objects. That's the same way that several of VirtualiZarr's other readers work (and are much better tested).
What confuses me here is how am I supposed to know the name of the array you created? Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention? Examples I find online of using rioxarray and rasterio are for opening GeoTIFF files - do they have more information in them about variable names? This is related to the failure you're seeing, because the virtualizarr parsing logic is expecting kerchunk to return something that has information about group names and array names.
Also is saying that the array dimensions are called "X" and "Y" correct? Or did kerchunk just make that up?
Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention?
They do not. there is just a single array, and sometimes overviews.
I've opened #292 to add your example as a test case and start trying to fix it @elyall.
Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention?
They do not. there is just a single array, and sometimes overviews.
Interesting... So if it's just one unnamed array, then "open_dataset" doesn't even make sense - the tiff reader should really only support open_dataarray, where we return a single unnamed xr.DataArray.
EDIT: My suggestion seems consistent with what open_rasterio does, returning a single DataArray - https://knowledge.dea.ga.gov.au/notebooks/How_to_guides/Opening_GeoTIFFs_NetCDFs/#Opening-a-single-GeoTIFF
Yes a dataarray is a better fit. I just need the ManifestArray object.
Cool. I'll be able to continue this next week, but you're welcome to work on it instead of you're interested.
There can be multiple arrays, including overviews. In fact before GeoTIFF had explicit overviews the multiple arrays (subdatasets in GDAL parlance) were used for overviews in some software (maybe only one tho ...).
I'm interested to explore this so hoping I can help a little 🙏
@mdsumner I'm not familiar with GeoTIFF but I'm guessing it's similar to other TIFF specs I'm aware of: a multi-page TIFF with specific metadata.
Readers like tifffile can read individual pages of multi-page TIFFs so each page probably should be treated as individual chunks wrapped in their own ManifestArray. And the default keys could be the page indices. There are lots of TIFF metadata specs out there and a subset probably have something that translates well into keys, but handling all of those cases would be difficult. It would be easiest to just allow the user to inject keys as an argument during creation.
Does multi-page mean (multi)subdataset analogously to multi-variable in NetCDF?
I'm not familiar with the tiff spec, mostly only with GeoTIFF usage details and the abstractions
"Pages" could be like bands, or layers, like actual pages in a pdf where the shape doesn't change, nor the pixel type. Sadly format idioms leak into these abstractions, so please indulge my questions I expect you'll have the same needs for clarification 🙏
Thanks both of you.
Readers like tifffile can read individual pages of multi-page TIFFs so each page probably should be treated as individual chunks wrapped in their own
ManifestArray.
Does multi-page mean (multi)subdataset analogously to multi-variable in NetCDF?
This was going to be my question too. In Xarray terminology what we need to know is whether each page of a multi-page TIFF maps to
- Different slices (potentially chunked) of a single
Variable/DataArray(that's what your example implies @elyall - maybe this is all we need), - Different (named) variables in the same
Dataset(i.e. different arrays that can have different shapes but any common dimensions have consistent shape), - Something even more messy, which might only be representable as groups of a
xarray.DataTree, or a list of unnamedDataArrays.
I'm keen to try and get this working, but it would be much appreciated if someone who actually works with (Geo)TIFF files regularly was interesting in helping maintain VirtualiZarr's TIFF reader. The rest of the package is designed to be as modular as possible to minimize the effort required, and we already have some developers who have helpfully contributed and taken ownership over other specific readers (e.g. DMR++). cc'ing @scottyhq and @sharkinsspatial as people who may be interested, but any lurkers feel free to pipe up 🪈
I've only used option 1 where all pages are slices in one array and therefore have matching dimensions, but that is not always the case.
Here's a great overview of internal TIFF structure. The Pyramid TIFF is an example of arrays of different sizes (though same dimensions) that would require separate keys (in this case a DataTree would work, example here).
As we can see it's complicated since from the underlying structure there are a lot of different TIFF specs that require their own data loaders. All IFD and SubIFD ImageBytes should be individual ManifestArrays. The one consistent thing is they are sequential so as a start the keys/schema could be input as an argument. From there individual use cases could be handled as desired.
I'm happy to help and potentially maintain a limited scope reader (TIFF spec itself shouldn't change much) but not all of the many edge cases.
I'm getting the same error still as the OP ??
import virtualizarr
virtualizarr.__version__
'1.1.1.dev12+g9e7d430'
from pathlib import Path
import numpy as np
from PIL import Image
from virtualizarr import open_virtual_dataset
from virtualizarr.backend import FileType
def write_random_tiff(file_path):
array = np.random.randint(0, 255, (128, 128), dtype=np.uint8)
img = Image.fromarray(array)
img.save(file_path)
file_path = Path.cwd() / "test.tif"
write_random_tiff(file_path)
open_virtual_dataset(str(file_path), filetype=FileType.tiff)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/backend.py", line 190, in open_virtual_dataset
vds = backend_cls.open_virtual_dataset(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/readers/tiff.py", line 48, in open_virtual_dataset
refs = extract_group(refs, group)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py", line 46, in extract_group
raise ValueError(
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []
I'm really keen to explore this, but having issues getting a working environment. will try again later :)
I got a little ahead of myself as kerchunk's tiff_to_zarr should already solve most of this problem. The heavy lifting is actually occurring in tifffile, specifically in tifffile.ZarrTiffStore. This if statement determines whether the output is a single array (DataArray) or multiple arrays (Dataset/DataTree). If multiple arrays they are sequentially indexed as you can see here. However tiff_to_zarr then plays with some metadata and potentially changes the keys (I'll need to dig into it more).
It would be great if we had an example TIFF file that has multiple arrays that someone understands well to cover the Dataset route. I found this sample GeoTIFF but it appears to be a single array. I could generate a multiscale OME-TIFF file but the keys are sequential and the data fits better in a DataTree than a Dataset.
there's a multi tiff here:
https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
gdalinfo /vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
Driver: GTiff/GeoTIFF
Files: /vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
Size is 20, 20
Image Structure Metadata:
INTERLEAVE=BAND
Subdatasets:
SUBDATASET_1_NAME=GTIFF_DIR:1:/vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
SUBDATASET_1_DESC=Page 1 (20P x 20L x 1B)
SUBDATASET_2_NAME=GTIFF_DIR:2:/vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
SUBDATASET_2_DESC=Page 2 (20P x 20L x 1B)
Corner Coordinates:
Upper Left ( 0.0, 0.0)
Lower Left ( 0.0, 20.0)
Upper Right ( 20.0, 0.0)
Lower Right ( 20.0, 20.0)
Center ( 10.0, 10.0)
Band 1 Block=20x20 Type=Byte, ColorInterp=Gray
Oh tifffile allows you to open tiffs as Zarr arrays?! Let's just use that! Then we can skip the kerchunk step entirely. Our tiff reader could re-use code from the (WIP) Zarr reader in #271.
Okay I tried an using an approach like
from tifffile import imread
store = imread(filepath, aszarr=True)
# TODO exception handling for TIFF files with multiple arrays
za = zarr.open_array(store=store, mode="r")
but ended up down a rabbit hole because a) tifffile doesn't appear to support zarr-python v3 yet and b) determining the actual byte ranges is still tricky - in kerchunk.tiff.tiff_to_zarr I think it happens in the store.write_fsspec step.
EDIT: I posted where I got to in #295 in case it's ever of any use.
I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process.
a) tifffile doesn't appear to support zarr-python v3 yet
It does not at the moment.
and b) determining the actual byte ranges is still tricky - in kerchunk.tiff.tiff_to_zarr I think it happens in the store.write_fsspec step.
That's correct, and write_fsspec is a method of tifffile.ZarrTiffStore. It's designed to write to a file, but kerchunk.tiff.tiff_to_zarr writes it to an in-memory text buffer and then converts it back to json.
there's a multi tiff here:
@mdsumner That file has two arrays with the same shape and dimensions. tifffile.ZarrTiffStore.write_fsspec for better or worse treats them as a single array with two chunks, one for each image, which fits better in a xarray.DataArray. The example we need is one where a TIFF has multiple arrays of different shapes, dimensions, or coords (and probably keys for each array that hopefully tifffile can parse), which would need to be wrapped in a xarray.Dataset or xarray.DataTree. OR we need an example where tifffile.ZarrTiffStore.write_fsspec understands a TIFF spec and places references to matching arrays (same shape and dimensions) under different zarr groups.
I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process.
Sounds good. Right now I can help most with developing tests but I'll take a look at some of the other VirtualiZarr readers and try to get a grasp of what they ultimately produce.
I think that's fine, multiple chunks, I think that's how they're intended to be interpreted, not as different arrays.(It's very old style)
Thanks both. I put my attempt to bypass kerchunk up in #295. I've also submitted #296 to change the documentation to no longer falsely advertise that we can read tiff files when we cannot yet.
I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process.
Sounds good. Right now I can help most with developing tests but I'll take a look at some of the other VirtualiZarr readers and try to get a grasp of what they ultimately produce.
It really shouldn't be particularly difficult to finish #292 - it mostly requires rejigging the virtualizarr.translators.kerchunk module to handle parsing the dict that kerchunk.tiff_to_zarr returns.
it mostly requires rejigging the virtualizarr.translators.kerchunk module to handle parsing the dict that kerchunk.tiff_to_zarr returns
I've done that here. The only real change was to have virtual_vars_from_kerchunk_refs accept arr_refs as input instead of refs. Now a single array as is produced from kerchunk.tiff_to_zarr can be passed in.
The question is, do you want to create a dataarray_from_kerchunk_refs equivalent to dataset_from_kerchunk_refs or do you want to add an optional key argument to dataset_from_kerchunk_refs that assigns the array to a key and then wraps it in a Dataset? Or another alternative I haven't thought of.
#297 looks good, thank you @elyall !
The question is, do you want to create a
dataarray_from_kerchunk_refsequivalent todataset_from_kerchunk_refsor do you want to add an optionalkeyargument todataset_from_kerchunk_refsthat assigns the array to a key and then wraps it in aDataset?
Yeah good question. The main advantage of coercing a single-image TIFF into a Dataset would be so that the user can always know that open_virtual_dataset would succeed on a .tiff file first time. Having an optional key argument wouldn't help with that, because then you have to call open_virtual_dataset('image.tiff'), get an error, then call open_virtual_dataset('image.tiff', key='dummy_var_name'), when at that point you may as well just do open_virtual_dataarray('image.tiff').
Or another alternative I haven't thought of.
We could image providing a new overloaded opening function like
def open_tiff(filepath: str) -> Dataset | DataArray | DataTree:
...
but I don't really like that idea either, because it's likely the next line of code just breaks depending on which type was returned, and it breaks the correspondence with xarray.open_dataset/open_datatree etc.
(Note that we have basically the same problem with not knowing if GRIB/netCDF files map to Dataset or DataTree without inspecting them once first to find out if they contain sub-groups.)
So I think I prefer that we add open_virtual_dataarray (like I started doing in #295), and depending on the type of TIFF file one of open_virtual_dataarray/dataset/datatree should succeed.
@maxrjones is this superseded by https://github.com/virtual-zarr/virtual-tiff?