VirtualiZarr Tiff test error

/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(
Traceback (most recent call last):
  File "/home/ubuntu/data/model-framework/test_vz.py", line 20, in <module>
    vds = open_virtual_dataset("s3://bananasplits/rasters/fleagle.tif", reader_options={'storage_options': aws_credentials})
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/backend.py", line 200, in open_virtual_dataset
    vds = backend_cls.open_virtual_dataset(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py", line 55, in open_virtual_dataset
    refs = extract_group(refs, group)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/translators/kerchunk.py", line 55, in extract_group
    raise ValueError(
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []

Installed vz half an hour ago, python 3.11.

Apr 02 '25 01:04 RichardScottOZ

I'm working on a rewrite of our TIFF virtualization over in https://github.com/zarr-developers/VirtualiZarr/pull/524, since this has been broken for a long while. Are you able to share what file you're trying to virtualize so that I can test it works?

Apr 02 '25 14:04 maxrjones

@RichardScottOZ the TIFF reader currently in the library is broken and not intended to be used (#291). Sorry about that - I thought we had a NotImplementedError but apparently not.

As @maxrjones says we hopefully will have a shiny new TIFF reader available soon!

FYI even for a reader that is working we would need a much more reproducible example to help you.

Apr 02 '25 14:04 TomNicholas

To expand a bit on the status update after reading the context in https://github.com/OSGeo/gdal/issues/11824, #524 is not yet ready for others to try. A list of GeoTIFFs/COGs that you'd like to make sure are supported would really speed up development, since it'll expand the list of compression schemes, etc. that are implemented. I will aim to get an MVP finished this week for experimentation.

I'm confident that using async_tiff, as done in that PR is the best path forward, in contrast to the other open/closed PRs you've seen, because it makes it simple to work with Zarr V3 internally. In contrast, TIFFFile (used by Kerchunk) has indicated that they do not plan to support Zarr V3 style references.

Apr 02 '25 14:04 maxrjones

I'm confident that using async_tiff, as done in that PR is the best path forward, in contrast to the other open/closed PRs you've seen, because it makes it simple to work with Zarr V3 internally. In contrast, TIFFFile (used by Kerchunk) has indicated that they do not plan to support Zarr V3 style references.

@maxrjones do you foresee any scenario that your async_tiff approach couldn't handle that the tifffile approach could? Should we just close the tifffile-related issues now?

Apr 02 '25 14:04 TomNicholas

I can make a more complete thing later today with a bit of luck

Apr 02 '25 19:04 RichardScottOZ

I'm working on a rewrite of our TIFF virtualization over in #524, since this has been broken for a long while. Are you able to share what file you're trying to virtualize so that I can test it works?

I can do something very similar at least Max.

Apr 02 '25 19:04 RichardScottOZ

import os
import configparser
import contextlib
import xarray as xr
from virtualizarr import open_virtual_dataset

def get_aws_credentials():
    parser = configparser.RawConfigParser()
    parser.read(os.path.expanduser('~/.aws/credentials'))
    credentials = parser.items('default')
    all_credentials = {key.upper(): value for key, value in [*credentials]}
    with contextlib.suppress(KeyError):
        all_credentials["AWS_REGION"] = all_credentials.pop("REGION")
    return all_credentials

creds = get_aws_credentials()

aws_credentials = {"key": creds['AWS_ACCESS_KEY_ID'], "secret": creds['AWS_SECRET_ACCESS_KEY']}
vds = open_virtual_dataset("s3://banana/test_reference.tif", reader_options={'storage_options': aws_credentials})

print(vds.mean())

#error

python test_vz.py 
/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py:47: UserWarning: storage_options have been dropped from reader_options as they are not supported by kerchunk.tiff.tiff_to_zarr
  warnings.warn(
Traceback (most recent call last):
  File "/home/ubuntu/data/model-framework/test_vz.py", line 20, in <module>
    vds = open_virtual_dataset("s3://banana/test_reference.tif", reader_options={'storage_options': aws_credentials})
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/backend.py", line 200, in open_virtual_dataset
    vds = backend_cls.open_virtual_dataset(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/readers/tiff.py", line 55, in open_virtual_dataset
    refs = extract_group(refs, group)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/pangeo/lib/python3.11/site-packages/virtualizarr/translators/kerchunk.py", line 55, in extract_group
    raise ValueError(
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []

Apr 02 '25 19:04 RichardScottOZ

@maxrjones test file above here https://gitlab.com/Richard.Scott1/raster-analysis-goals/-/blob/main/test_reference.tif?ref_type=heads

Apr 02 '25 20:04 RichardScottOZ

thank you @RichardScottOZ! what timezone are you in? I know there's a lot of shared interest right now on Virtual approaches with TIFFs including you, me (e.g., https://github.com/maxrjones/why-virtualize-geotiff), @mdsumner, and @norlandrhagen. It could be fun to have a bug smashing session on the TIFF reader next week if you're interested

Apr 02 '25 20:04 maxrjones

Would love to join!

Apr 02 '25 20:04 norlandrhagen

here's a few COGs that I would expect to work, I thought I was just doing something wrong (but also didn't pursue very deeply)

https://projects.pawsey.org.au/idea-gebco-tif/GEBCO_2024.tif (4Gb)

https://github.com/mdsumner/rema-ovr/raw/refs/heads/main/rema_mosaic_1km_v2.0_filled_cop30_dem.tif (60Mb)

https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/55/G/EN/2025/3/S2C_T55GEN_20250324T000834_L2A/TCI.tif (250Mb)

https://data.source.coop/ausantarctic/ghrsst-mur-v2/2025/03/31/20250331090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1_analysed_sst.tif (300Mb)

https://github.com/mdsumner/ibcso-cog/raw/main/IBCSO_v2_ice-surface_cog.tif (220Mb)

I would also add the autotest suite from GDAL to really get a comprehensive try at file vagaries

I've listed all current "\.tif$" files here (prepend 'https://raw.githubusercontent.com/OSGeo/gdal/refs/heads/master/autotest' to get full path)

https://gist.githubusercontent.com/mdsumner/29b22ece80c829ae4aefbecbf4eef531/raw/a8ea6f13f8c96b8434b7fc3ae0629814914135c0/autotest_tif.txt

Apr 02 '25 20:04 mdsumner

I'll dig up other large files that aren't COGs but famous ones are GEBCO and IBCSO:

https://www.gebco.net/data_and_products/gridded_bathymetry_data/ (the tiff link is a directory that is the file in a zip, so it's weird to have that as a canon url and while you can stream it it doesn't work in some places)

https://doi.pangaea.de/10.1594/PANGAEA.937574?format=html#download (https://download.pangaea.de/dataset/937574/files/IBCSO_v2_ice-surface_WGS84.tif)

Apr 02 '25 21:04 mdsumner

thank you @RichardScottOZ! what timezone are you in? I know there's a lot of shared interest right now on Virtual approaches with TIFFs including you, me (e.g., https://github.com/maxrjones/why-virtualize-geotiff), @mdsumner, and @norlandrhagen. It could be fun to have a bug smashing session on the TIFF reader next week if you're interested

Next week Australian Central Time - GMT +9.5 - half an hour behind Michael basically.

Apr 02 '25 22:04 RichardScottOZ

thank you @RichardScottOZ! what timezone are you in? I know there's a lot of shared interest right now on Virtual approaches with TIFFs including you, me (e.g., maxrjones/why-virtualize-geotiff), @mdsumner, and @norlandrhagen. It could be fun to have a bug smashing session on the TIFF reader next week if you're interested

Next week Australian Central Time - GMT +9.5 - half an hour behind Michael basically.

We don't have much overlap in waking hours, but I sent an invite to y'all for Monday evening GMT-4 Tuesday / Tuesday morning GMT+9.5. No worries if it doesn't work out, the examples shared here will give me plenty to work from.

Apr 03 '25 00:04 maxrjones

That should be ok I think Max.

Apr 03 '25 10:04 RichardScottOZ

To jump in on this - here's a set of GeoTIFFs I'm working on virtualizing: SENTINEL1 Sigma Nought (SIG0) Backscatter in 20 meter resolution.

I've been following a manual approach with tifffile following this tutorial and the lessons from this issue https://github.com/fsspec/kerchunk/issues/78, but the lack of access to intra-file chunks makes it a nonstarter for our usecase unfortunately. We're fitting models to multi-year timeseries stacks per pixel, which requires loading the entire stack of images per 15000x15000 tile into memory when you can't access sub-file chunks. Too much for our beefy machines. I can build the virtual zarr just fine and use it to load data without a problem, but this specific processing case is important and just doesn't work well right now.

Note the nonstandard blocking of 5x15000. I've heard tell from coworkers that there may be some images in there with different blocking but I haven't run any checks to confirm that yet. I do wonder if this would cause some issues when using the proposed TIF reader if true.

I admire all the hard work being done here to enable big geodata workflows, huge thanks to everyone involved :)

Apr 04 '25 07:04 claytharrison

We're fitting models to multi-year timeseries stacks per pixel, which requires loading the entire stack of images per 15000x15000 tile into memory when you can't access sub-file chunks. Too much for our beefy machines. I can build the virtual zarr just fine and use it to load data without a problem, but this specific processing case is important and just doesn't work well right now.

This sounds like a scenario where you're better off rechunking your data to align more with your access pattern (i.e. contiguous in time; chunked in space) and then storing it as native (not virtual Zarr).

Apr 04 '25 13:04 rabernat

This sounds like a scenario where you're better off rechunking your data to align more with your access pattern (i.e. contiguous in time; chunked in space) and then storing it as native (not virtual Zarr).

I absolutely agree, and this is where we're slowly headed, but this is a PB-scale dataset that plenty of operational workflows depend on, so it's going to take a while before that's done unfortunately.

Virtualizing it would be a really nice in-between step, and it feels so close if only we could represent those tiff chunks.

Apr 04 '25 14:04 claytharrison

@claytharrison that makes sense, and is a pattern I expect to see a lot in the future. First virtualize the data with the original chunks for cloud-optimized but still-suboptimal performance, then later create additional rechunked copies optimized for expected query patterns.

Apr 04 '25 14:04 TomNicholas

I've been following a manual approach with tifffile following this tutorial and the lessons from this issue https://github.com/fsspec/kerchunk/issues/78, but the lack of access to intra-file chunks makes it a nonstarter for our usecase unfortunately.

Tifffile provides access to intra-file chunks. Did you try to create reference files for the individual files (for example via tiff2fsspec) and merge them (assuming image sizes, chunk sizes, data types, compressions, etc match).

you foresee any scenario that your async_tiff approach couldn't handle that the tifffile approach could?

Does the async_tiff approach handle TIFF-like formats (ImageJ, LSM, NDPI, etc), volumetric tiles, sparse segments, higher dimensional datasets, JPEG tables? Some features found in TIFF that I have been struggling to represent in virtual references are bitorder reversal, packed integers, float24 and complex integer data types, color profiles, variable JPEG tables across TIFF pages, and incomplete chunks (cropped tiles and last strips).

Apr 04 '25 16:04 cgohlke

FYI we can (and it sounds like we should) have multiple TIFF readers for virtualizarr. After #498 implementing one of these readers will be as simple as writing one function like reader_func(path: str) -> ManifestStore:, so we can and should have multiple to serve all possible requirements.

Apr 04 '25 17:04 TomNicholas

Thanks for joining the discussion @cgohlke and for all your work on TIFFFile and imagecodecs!

Does the async_tiff approach handle TIFF-like formats (ImageJ, LSM, NDPI, etc), volumetric tiles, sparse segments, higher dimensional datasets, JPEG tables?

Probably not given it's relatively new and motivated primarily by geospatial use-cases, though @kylebarron would know more as the author of async_tiff.

Some features found in TIFF that I have been struggling to represent in virtual references are bitorder reversal, packed integers, float24 and complex integer data types, color profiles, variable JPEG tables across TIFF pages, and incomplete chunks (cropped tiles and last strips).

This is really helpful context, thanks for sharing. I don't expect any of this to be easier with async_tiff, since we still need to translate the metadata to be Zarr-compatible and will run up against the same absence of float24 or complex integer dtypes, for example. I also expect that we'll want to register imagecodecs codecs in the Array metadata (if that's possible) meaning all codec limitations will be the same.

Apr 07 '25 02:04 maxrjones

FYI we can (and it sounds like we should) have multiple TIFF readers for virtualizarr. After https://github.com/zarr-developers/VirtualiZarr/issues/498 implementing one of these readers will be as simple as writing one function like reader_func(path: str) -> ManifestStore:, so we can and should have multiple to serve all possible requirements.

+1. I'm currently debating whether it would even be better to develop the async_tiff reader separately from the VirtualiZarr repo rather than continuing to build it as a refactor of TIFFVirtualBackend as in https://github.com/zarr-developers/VirtualiZarr/pull/524. Here are my reasons for considering building it out separately (currently experimenting in https://github.com/maxrjones/virtual-tiff):

If there is a single canonical TIFF reader, it might make more sense to base it off TIFFFile as a more established library than async_tiff
I'd like to be really comprehensive in testing, which might be discouraged considering the VirtualiZarr's primary objective is not readers. As one example, I'm currently downloading and using all of GDAL's test TIFF files (thanks to Michael for the pointer to these)
Developing the async_tiff reader separately would speed up tests all around
VirtualiZarr is approaching a stable API while also getting a lot more engagement, so it might be time in general to consider splitting off readers to achieve a more sustainable maintenance model

Apr 07 '25 02:04 maxrjones

I agree that there's scope for multiple TIFF readers. tifffile has been around longer, is in much wider use, and seems to be very stable, so it's probably the better default to use. But judging from other obstore performance improvements (https://github.com/zarr-developers/zarr-python/pull/1661#issuecomment-2780104437) I think there's still potential in an opt-in async-tiff backend (they each use the same Rust IO code under the hood).

Does the async_tiff approach handle TIFF-like formats (ImageJ, LSM, NDPI, etc), volumetric tiles, sparse segments, higher dimensional datasets, JPEG tables?

The current API of async-tiff is quite minimal and mostly consists of parsing and then exposing raw IFD metadata to the user.

I would say that TIFF-like formats that are not actually TIFF are not in scope. It is able to parse the JPEG tables from the IFD metadata though.

Some features found in TIFF that I have been struggling to represent in virtual references are bitorder reversal, packed integers, float24 and complex integer data types, color profiles, variable JPEG tables across TIFF pages, and incomplete chunks (cropped tiles and last strips).

I think this question is less about parsing TIFF and more about how to represent the parsed metadata in virtual references? In that case those questions go back to @maxrjones and @TomNicholas, as I haven't been following that side of the work stream. I've only been working on the low-level format parsing.

Apr 07 '25 14:04 kylebarron

I'm currently debating whether it would even be better to develop the async_tiff reader separately from the VirtualiZarr repo

I would be fine with that. But we should have some default TIFF reader somewhere because it's such a widely-used format.

Some features found in TIFF that I have been struggling to represent in virtual references are bitorder reversal, packed integers, float24 and complex integer data types, color profiles, variable JPEG tables across TIFF pages, and incomplete chunks (cropped tiles and last strips).

I think this question is less about parsing TIFF and more about how to represent the parsed metadata in virtual references?

Yes, but mostly it's actually a zarr issue rather than a virtualizarr issue (requiring new codecs or data types for example). It's easy to represent any codec or data type in virtualizarr, iff it is actually supported by zarr.

Apr 07 '25 15:04 TomNicholas