VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

open_virtual_dataset doesn't resolve Azure storage path

Open ecamossi opened this issue 5 months ago • 5 comments

Hello everyone,

I'm trying to use virtualizarr version 2.0.1. to virtualize the access to some netCDF data files stored on an Azure storage container, but the creation of virtual datasets fails when resolving the remote urls.

The code below is a minimal example to create a virtual dataset for a single netCDF file, raising the same error. The remote file is accessible from the Azure storage container, and the remote url to the file is correctly resolved if registry.resolve is run out of open_virtual_dataset. When executing open_virtual_dataset to virtualise the same file, the url is mapped to local storage of the compute instance where the code is executed, which does not exist, making the resolve function fail.

code snippet with results (details are removed)

import os
import sys
import fsspec
import glob
import adlfs

import obstore as obs

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers.hdf import HDFParser
from virtualizarr.registry import ObjectStoreRegistry

bucket = "abfs://"+os.environ["AZURE_STORAGE_CONTAINER"]  # env variable for Azure storage container
store = obs.store.from_url(bucket, account_name=os.environ["AZURE_STORAGE_ACCOUNT"],skip_signature=True) # env variable for Azure storage account

parser = HDFParser()
registry = ObjectStoreRegistry({f"{bucket}": store})

f_url = f'abfs://<my_azure_storage_container>/<remote_path_to_netcdf_file>'
registry.resolve(url=f_url)

The remote url to the file is correctly resolved by the instruction above

AzureStore(container_name="<my_azure_storage_container>", account_name="<my_azure_storage_account>"),
 '<remote_path_to_netcdf_file>'

but not inside open_virtual_dataset

vds = open_virtual_dataset(
  url=_url,
  parser=parser,
  registry=registry,
  loadable_variables=[],
)

which raises this error

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[...], line 1
----> 1 vds = open_virtual_dataset(
      2   url=_url,
      3   parser=parser,
      4   registry=registry,
      5   loadable_variables=[],
      6 )

File [[...]/lib/python3.12/site-packages/virtualizarr/xarray.py:87](https://[...]lib/python3.12/site-packages/virtualizarr/xarray.py#line=86), in open_virtual_dataset(url, registry, parser, drop_variables, loadable_variables, decode_times)
     45 """
     46 Open an archival data source as an [xarray.Dataset][] wrapping virtualized zarr arrays.
     47 
   (...)     83     in `loadable_variables` and normal lazily indexed arrays for each variable in `loadable_variables`.
     84 """
     85 filepath = validate_and_normalize_path_to_uri(url, fs_root=Path.cwd().as_uri())
---> 87 manifest_store = parser(
     88     url=filepath,
     89     registry=registry,
     90 )
     92 ds = manifest_store.to_virtual_dataset(
     93     loadable_variables=loadable_variables,
     94     decode_times=decode_times,
     95 )
     96 return ds.drop_vars(list(drop_variables or ()))

File [[...]lib/python3.12/site-packages/virtualizarr/parsers/hdf/hdf.py:168]([...]lib/python3.12/site-packages/virtualizarr/parsers/hdf/hdf.py#line=167), in HDFParser.__call__(self, url, registry)
    147 def __call__(
    148     self,
    149     url: str,
    150     registry: ObjectStoreRegistry,
    151 ) -> ManifestStore:
    152     """
    153     Parse the metadata and byte offsets from a given HDF5[/NetCDF4]([...]NetCDF4) file to produce a VirtualiZarr
    154     [ManifestStore][virtualizarr.manifests.ManifestStore].
   (...)    166         A [ManifestStore][virtualizarr.manifests.ManifestStore] which provides a Zarr representation of the parsed file.
    167     """
--> 168     store, path_in_store = registry.resolve(url)
    169     reader = ObstoreReader(store=store, path=path_in_store)
    170     manifest_group = _construct_manifest_group(
    171         filepath=url,
    172         reader=reader,
    173         group=self.group,
    174         drop_variables=self.drop_variables,
    175     )

File [[...]lib/python3.12/site-packages/virtualizarr/registry.py:264]([...]lib/python3.12/site-packages/virtualizarr/registry.py#line=263), in ObjectStoreRegistry.resolve(self, url)
    262             path_after_prefix = path.lstrip("[/](https://[...].azureml.ms/)")
    263         return store, path_after_prefix
--> 264 raise ValueError(f"Could not find an ObjectStore matching the url `{url}`")

ValueError: Could not find an ObjectStore matching the url `[file:///mnt/batch/tasks/shared/LS_root/mounts/clusters/<path-to-local-storage-of-compute-instance>/abfs%3A/<my_azure_storage_container>/<remote_path_to_netcdf_file>`](file:///mnt/batch/tasks/shared/LS_root/mounts/clusters/<path-to-local-storage-of-compute-instance>/abfs%3A/<my_azure_storage_container>/<remote_path_to_netcdf_file>%60)

Any comment or suggestion is much appreciated. Thank you!!

ecamossi avatar Aug 12 '25 16:08 ecamossi

Thanks for raising.

I think the problem is that the scheme 'abfs://' is not in VALID_URI_PREFIXES, so open_virtual_dataset is assuming that it's a local path.

This is obviously bad behaviour, and I think this logic is due for an overhaul - probably the answer is to forbid relative paths entirely so it's never ambiguous whether or not something is a local file, and we can remove the faulty logic entirely. (see https://github.com/zarr-developers/VirtualiZarr/issues/685).

In the meantime if you want a hacky way around it you can try altering VALID_URI_PREFIXES. You'll have to let me know if that works though - I'm not setup up with an azure bucket.

TomNicholas avatar Aug 13 '25 23:08 TomNicholas

Thanks a lot for the suggestion! As you suggested, I made a small test to fix the list of manifests at least in my code:

virtualizarr.manifests.manifest.VALID_URI_PREFIXES = { "abfs://", "s3://", "gs://", "azure://", "r2://", "cos://", "minio://", "file:///", "http://", "https://", }

but, strangely enough, I get two different behaviours, depending on how i call open_virtual_dataset.

Trying to open a single file (with the same configuration above), using:

vds = open_virtual_dataset(
  url=_url,
  parser=parser,
  registry=registry,
  loadable_variables=[],
)

now the path to file is correctly mapped to the correct hyperlink/reference in by Azure Storage container https://<remote_path_to_container>/<remote_path_to_netcdf_file>!

I'm not able to open the file, because I get an authentication error, but It's likely something that concerns the obstore authentication method. However, for completeness, I past here the error output, where you can see that the url to the data is correctly mapped:

---------------------------------------------------------------------------
UnauthenticatedError                      Traceback (most recent call last)
Cell In[26], line 2
      1 # open one file 
----> 2 vds = open_virtual_dataset(
      3   url=_url,
      4   parser=parser,
      5   registry=registry,
      6   loadable_variables=[],
      7 )
      8 print(vds)

File [..] lib/python3.12/site-packages/virtualizarr/xarray.py:87, in open_virtual_dataset(url, registry, parser, drop_variables, loadable_variables, decode_times)
     45 """
     46 Open an archival data source as an [xarray.Dataset][] wrapping virtualized zarr arrays.
     47 
   (...)     83     in `loadable_variables` and normal lazily indexed arrays for each variable in `loadable_variables`.
     84 """
     85 filepath = validate_and_normalize_path_to_uri(url, fs_root=Path.cwd().as_uri())
---> 87 manifest_store = parser(
     88     url=filepath,
     89     registry=registry,
     90 )
     92 ds = manifest_store.to_virtual_dataset(
     93     loadable_variables=loadable_variables,
     94     decode_times=decode_times,
     95 )
     96 return ds.drop_vars(list(drop_variables or ()))

File [..] lib/python3.12/site-packages/virtualizarr/parsers/hdf/hdf.py:169, in HDFParser.__call__(self, url, registry)
    152 """
    153 Parse the metadata and byte offsets from a given HDF5/NetCDF4 file to produce a VirtualiZarr
    154 [ManifestStore][virtualizarr.manifests.ManifestStore].
   (...)    166     A [ManifestStore][virtualizarr.manifests.ManifestStore] which provides a Zarr representation of the parsed file.
    167 """
    168 store, path_in_store = registry.resolve(url)
--> 169 reader = ObstoreReader(store=store, path=path_in_store)
    170 manifest_group = _construct_manifest_group(
    171     filepath=url,
    172     reader=reader,
    173     group=self.group,
    174     drop_variables=self.drop_variables,
    175 )
    176 # Convert to a manifest store

File [..] lib/python3.12/site-packages/virtualizarr/utils.py:48, in ObstoreReader.__init__(self, store, path)
     36 def __init__(self, store: ObjectStore, path: str) -> None:
     37     """
     38     Create an obstore file reader that implements the read, readall, seek, and tell methods, which
     39     can be used in libraries that expect file-like objects.
   (...)     46         The path to the file within the store. This should not include the prefix.
     47     """
---> 48     self._reader = obs.open_reader(store, path)

UnauthenticatedError: The operation lacked valid authentication credentials for path <remote_path_to_netcdf_file>: Error performing HEAD https://<remote_path_to_container>/<remote_path_to_netcdf_file> in 140.402647ms - Server returned non-2xx status code: 401 Unauthorized: 

Debug source:
Unauthenticated {
    path: "<remote_path_to_netcdf_file>",
    source: RetryError(
        RetryErrorImpl {
            method: HEAD,
            uri: Some(
                ttps://<remote_path_to_container>/<remote_path_to_netcdf_file>,
            ),
            retries: 0,
            max_retries: 10,
            elapsed: 140.402647ms,
            retry_timeout: 180s,
            inner: Status {
                status: 401,
                body: Some(
                    "",
                ),
            },
        },
    ),
}

However, strangely enough the fix doesn't work if the open_virtual_dataset is called in a loop. In this case, I'm still getting the same error as before (despite some % in the hyperlink changed a bit):

virtual_datasets = [
        open_virtual_dataset(
            url = filepath, 
            parser=parser,
            registry=registry, 
            #object_store=store, 
            loadable_variables=[],
        )
        for filepath in blob_fs.glob(f'{pattern}')
]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[27], line 2
      1 virtual_datasets = [
----> 2         open_virtual_dataset(
      3             url = filepath, 
      4             parser=parser,
      5             registry=registry, 
      6             #object_store=store, 
      7             loadable_variables=[],
      8         )
      9         for filepath in blob_fs.glob(f'{pattern}')
     10 ]

File [...] lib/python3.12/site-packages/virtualizarr/xarray.py:87, in open_virtual_dataset(url, registry, parser, drop_variables, loadable_variables, decode_times)
     45 """
     46 Open an archival data source as an [xarray.Dataset][] wrapping virtualized zarr arrays.
     47 
   (...)     83     in `loadable_variables` and normal lazily indexed arrays for each variable in `loadable_variables`.
     84 """
     85 filepath = validate_and_normalize_path_to_uri(url, fs_root=Path.cwd().as_uri())
---> 87 manifest_store = parser(
     88     url=filepath,
     89     registry=registry,
     90 )
     92 ds = manifest_store.to_virtual_dataset(
     93     loadable_variables=loadable_variables,
     94     decode_times=decode_times,
     95 )
     96 return ds.drop_vars(list(drop_variables or ()))

File [...] lib/python3.12/site-packages/virtualizarr/parsers/hdf/hdf.py:168, in HDFParser.__call__(self, url, registry)
    147 def __call__(
    148     self,
    149     url: str,
    150     registry: ObjectStoreRegistry,
    151 ) -> ManifestStore:
    152     """
    153     Parse the metadata and byte offsets from a given HDF5/NetCDF4 file to produce a VirtualiZarr
    154     [ManifestStore][virtualizarr.manifests.ManifestStore].
   (...)    166         A [ManifestStore][virtualizarr.manifests.ManifestStore] which provides a Zarr representation of the parsed file.
    167     """
--> 168     store, path_in_store = registry.resolve(url)
    169     reader = ObstoreReader(store=store, path=path_in_store)
    170     manifest_group = _construct_manifest_group(
    171         filepath=url,
    172         reader=reader,
    173         group=self.group,
    174         drop_variables=self.drop_variables,
    175     )

File [...] lib/python3.12/site-packages/virtualizarr/registry.py:264, in ObjectStoreRegistry.resolve(self, url)
    262             path_after_prefix = path.lstrip("/")
    263         return store, path_after_prefix
--> 264 raise ValueError(f"Could not find an ObjectStore matching the url `{url}`")

ValueError: Could not find an ObjectStore matching the url `[file:///mnt/batch/tasks/shared/LS_root/mounts/clusters/<path-to-local-storage-of-compute-instance><my_azure_storage_container>/<remote_path_to_netcdf_file>`](file:///mnt/batch/tasks/shared/LS_root/mounts/clusters/<path-to-local-storage-of-compute-instance><my_azure_storage_container>/<remote_path_to_netcdf_file>%60)

Do you have any other suggestion or test that I can do? I would be happy to help solving this issue. Thanks a lot!

ecamossi avatar Aug 20 '25 14:08 ecamossi

Hello again!

I figured out what the problem was: I forgot to prefix the protocol in the URLs to the NetCDF files in the loop case.
I can confirm that your fix works in both cases (one file, multiple files), with virtualizarr 2.1.0 and obstore 0.8.0.

I also figured out the problem with the authentication to the AzureStore, which is not related to your package but could be helpful for others to know (the AZURE_STORAGE_ACCOUNT_KEY shall be passed separately, instead of passing the complete connection string).

I copy below the complete example, for reference.

Thanks a lot for your help!

Best, -Elena


Some import:

import obstore as obs

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers.hdf import HDFParser
from virtualizarr.registry import ObjectStoreRegistry

This is to suppress warnings when opening the NetCDF files, as suggested in the virtualizarr documentation

import warnings
warnings.filterwarnings(
  "ignore",
  message="Numcodecs codecs are not in the Zarr version 3 specification*",
  category=UserWarning
)

Your suggested fix:

virtualizarr.manifests.manifest.VALID_URI_PREFIXES = {
    "abfs://",
    "s3://",
    "gs://",
    "azure://",
    "r2://",
    "cos://",
    "minio://",
    "file:///",
    "http://",
    "https://",
}

Setting the store:

bucket = "abfs://"+AZURE_STORAGE_CONTAINER
datapath = <path-to-data-in-my-container>

store = obs.store.from_url(url = f"{bucket}/{datapath}", 
                           account_name=AZURE_STORAGE_ACCOUNT,
                           account_key = AZURE_STORAGE_ACCOUNT_KEY,
                           skip_signature = False
                           )

registry = ObjectStoreRegistry({f"{bucket}": store})

parser = HDFParser()

Construct the virtual dataset:

pattern = f'<...>/*.nc'    #remote-path-to-netCDFs 

virtual_datasets = [
        open_virtual_dataset(
        url=f'abfs://{filepath}',
        parser=parser,
        registry=registry  
        )
        for filepath in blob_fs.glob(f'{pattern}')
]

and finally, the result of print(virtual_datasets):

[<xarray.Dataset> Size: 339MB
 Dimensions:     (valid_time: 366, latitude: 161, longitude: 1440)
 Coordinates:
   * valid_time  (valid_time) datetime64[ns] 3kB 1980-01-01 ... 1980-12-31
     number      int64 8B ManifestArray<shape=(), dtype=int64, chunks=()>
     latitude    (latitude) float64 1kB ManifestArray<shape=(161,), dtype=floa...
     longitude   (longitude) float64 12kB ManifestArray<shape=(1440,), dtype=f...
 Data variables:
     d2m         (valid_time, latitude, longitude) float32 339MB ManifestArray...
 Attributes:
     GRIB_centre:             ecmf
     GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
     GRIB_subCentre:          0
     Conventions:             CF-1.7
     institution:             European Centre for Medium-Range Weather Forecasts
     history:                 2025-08-06T10:55 GRIB to CDM+CF via cfgrib-0.9.1...,
[...]

ecamossi avatar Aug 21 '25 12:08 ecamossi

I see this issue is closed, but since @ecamossi verified that adding abfs:// in her code works, shouldn't abfs:// be added to VALID_URL_PREFIXES?

rsignell avatar Oct 23 '25 13:10 rsignell

I think you're right @rsignell . In fact I'm not sure we should have VALID_URL_PREFIXES at all.

TomNicholas avatar Oct 23 '25 17:10 TomNicholas