VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Support passing configuration options to default_object_store

Open TomNicholas opened this issue 8 months ago • 6 comments

Encountered while working on #557

vz.open_virtual_dataset(
    's3://cworthy/oae-efficiency-atlas/data/experiments/000/01/alk-forcing.000-1999-01.pop.h.0347-01.nc',
    loadable_variables=[],
    decode_times=False,
    reader_options={'storage_options': {'anon': True, 'endpoint_url': 'https://data.source.coop/'}},
)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File <timed exec>:1

File ~/Documents/Work/Code/VirtualiZarr/virtualizarr/backend.py:351, in open_virtual_mfdataset(paths, concat_dim, compat, preprocess, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
    347 executor = get_executor(parallel=parallel)
    348 with executor() as exec:
    349     # wait for all the workers to finish, and send their resulting virtual datasets back to the client for concatenation there
    350     virtual_datasets = list(
--> 351         exec.map(
    352             open_func,
    353             paths1d,
    354         )
    355     )
    357 # TODO add file closers
    358 
    359 # Combine all datasets, closing them in case of a ValueError
    360 try:

File ~/Documents/Work/Code/VirtualiZarr/virtualizarr/parallel.py:310, in LithopsEagerFunctionExecutor.map(self, fn, timeout, chunksize, *iterables)
    307 fexec = lithops.FunctionExecutor()
    309 futures = fexec.map(fn, *iterables)
--> 310 results = fexec.get_result(futures)
    312 return results
...
File /usr/local/lib/python3.12/site-packages/virtualizarr/manifests/store.py:134, in _find_bucket_region()

File /usr/local/lib/python3.12/site-packages/requests/structures.py:52, in __getitem__()

KeyError: 'x-amz-bucket-region'

TomNicholas avatar Apr 22 '25 20:04 TomNicholas

From @sharkinsspatial :

This is a case where as @Max outlined in https://github.com/zarr-developers/VirtualiZarr/issues/553 we are caught in a transition period where the open_virtual_dataset signature no longer aligns with the internals. In this case you would want to an S3Store pre-constructed using the aws_endpoint argument pointing to the 'https://data.source.coop/ proxy endpoint into a store arg for open_virtual_dataset . So we will need some way for users to inject there own obstore for whatever function signature we decide on.

TomNicholas avatar Apr 22 '25 20:04 TomNicholas

In addition to what Sean said, https://github.com/zarr-developers/VirtualiZarr/pull/558 at least gives you an informative error message for what went wrong.

We also probably want to encourage backends to flag if they receive options that they do not use. In this case the newer HDF5 reader does not use storage_options.

maxrjones avatar Apr 22 '25 20:04 maxrjones

I think from the recent conversation in the slack, people would rather a way to pass configuration options into default_object_store() over explicitly providing a store. @TomNicholas are you working on this? Otherwise I could open a PR, but happy to leave it to you if you're on it

maxrjones avatar Apr 22 '25 21:04 maxrjones

I was about to have a go, but haven't started yet!

TomNicholas avatar Apr 22 '25 21:04 TomNicholas

I was about to have a go, but haven't started yet!

Nice, I'll provide a review then. Thanks for working on it

maxrjones avatar Apr 22 '25 21:04 maxrjones

It turns out that this works totally fine without any changes to the code

vds = vz.open_virtual_dataset(
    "https://data.source.coop/cworthy/oae-efficiency-atlas/data/polygon_masks.nc", 
    backend=HDFVirtualBackend,
)

the reason being that under the hood passing the URL in this form creates an obstore HTTPStore instead of an S3Store like this

store = obs.store.HTTPStore.from_url("https://data.source.coop/")

then

reader = ObstoreReader(store=store, path="https://data.source.coop/cworthy/oae-efficiency-atlas/data/polygon_masks.nc")

which works fine with no further configuration options passed.

I'm going to close #560 for now as not necessary (but not close this issue as this use case should at least be documented).

TomNicholas avatar Apr 23 '25 14:04 TomNicholas

Completed by #601

maxrjones avatar Jun 16 '25 20:06 maxrjones