VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Should `open_virtual_dataset` accept relative paths?

Open maxrjones opened this issue 5 months ago • 4 comments

We've made file handling a lot more explicit in the migration towards V2.0, but still accept relative file paths in open_virtual_dataset and open_virtual_mfdataset. For relative paths, we prepend the current working directory via https://github.com/zarr-developers/VirtualiZarr/blob/2e652c3bbe8339698fb2fde390be232596939736/virtualizarr/xarray.py#L47 and https://github.com/zarr-developers/VirtualiZarr/blob/2e652c3bbe8339698fb2fde390be232596939736/virtualizarr/manifests/manifest.py#L54-L85

I think it would be better to require absolute filepaths that specify a scheme via a check such as:

    parsed = urlparse(file_url)
    if not parsed.scheme:
        raise ValueError(
            f"Urls are expected to contain a scheme (e.g., `file://` or `s3://`), received {file_url} which parsed to {parsed}"
        )

Users would need to learn a habit of using a scheme with local files, but it would make the logic of file handling way more obvious.

Any thoughts @zarr-developers/virtualizarr-maintainers?

maxrjones avatar Jul 16 '25 23:07 maxrjones

We're also using an extra private xarray method from xarray.backends.common import _find_absolute_paths to handle relative paths in open_virtual_mfdataset which isn't great.

maxrjones avatar Jul 18 '25 20:07 maxrjones

So I don't think that making this change would make much difference to code portability in practice, because:

  • It only applies to local files (there's no such thing as a relative S3 URL),
  • We can't stop the user writing os.pwd() + "relative/path/to/file.nc" in their own code, which then wouldn't be portable anyway,
  • Code that interacts with local files is inherently not very portable anyway, because it's not portable across machines by definition.

However I do agree that the extra explicitness and ability to remove that xarray internal would be nice. It also means that we don't need to actually do anything with the filepaths whatsoever before they get passed to the parser. The one exception to that is if we support globbing, which I think it would be nice to... I think that has to happen outside the parser though 😕

(Note that we will still need the fs_root argument, because some pre-existing kerchunk files come with only relative paths. It's just that we would get rid of the fs_root=Path.cwd().as_uri() as a default value.)

TomNicholas avatar Jul 18 '25 21:07 TomNicholas

we will still need the fs_root argument, because some pre-existing kerchunk files come with only relative paths.

Actually could this just be a kwarg to the kerchunk parsers?

TomNicholas avatar Aug 13 '25 23:08 TomNicholas

we will still need the fs_root argument, because some pre-existing kerchunk files come with only relative paths.

Actually could this just be a kwarg to the kerchunk parsers?

This issue might be outdated because I think we currently use fs_root as a kwarg to the kerchunk json and parquet parsers.

maxrjones avatar Aug 15 '25 00:08 maxrjones