VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Use obstore / obspec for globbing in `open_virtual_mfdataset`

Open TomNicholas opened this issue 7 months ago • 4 comments

Problem

xarray.open_mfdataset accepts a string with wildcards, and then uses this janky bit of fsspec code to glob with it.

But it's pretty fragile - in particular it confusingly raises if you try to use glob syntax with an s3 URL without using the xarray zarr backend.

VirtualiZarr currently imports that private internal to do the same kind of globbing, but VirtualiZarr doesn't even have backends in the same way, which is why I attempted to improve the situation upstream (see https://github.com/pydata/xarray/pull/9930).

Solution

However I realize now that a better way to improve xarray upstream might be to use obstore and obspec instead of fsspec, and make a robust internal utility in xarray (that doesn't raise a random exception for only one xarray backend) and which virtualizarr can safely import.

Therefore I think we should:

  1. vendor those internals into virtualizarr instead of importing them (soon because I think globbing remote urls from open_virtual_mfdataset is broken right now because of that exception),
  2. iterate and improve them using obstore and obspec,
  3. eventually push the changes upstream so that xarray no longer needs fsspec for that.

cc the usual suspects @maxrjones @sharkinsspatial @kylebarron

EDIT: related to #568 too.

TomNicholas avatar Apr 25 '25 21:04 TomNicholas

This seems great. We could even make a standalone utility for globbing. There are probably enough niche edge cases around how glob characters are interpreted.

Perhaps we should go ahead with an 0.1 release of obspec soon (and leave the question of exceptions for the future)?

kylebarron avatar May 06 '25 19:05 kylebarron

We could even make a standalone utility for globbing.

fsspec.glob() is very useful.

(and leave the question of exceptions for the future)

The exceptions thing I mentioned is unrelated to obspec - that's just a quirk in the code Martin added to Xarray.

TomNicholas avatar May 06 '25 19:05 TomNicholas

We could even make a standalone utility for globbing.

fsspec.glob() is very useful.

Perhaps this could go into a library like obspec-utils. That would provide a clear separation between the core library/protocol and extra provided functionality.

(and leave the question of exceptions for the future)

The exceptions thing I mentioned is unrelated to obspec - that's just a quirk in the code Martin added to Xarray.

No I mean that obspec is currently "blocked" on trying to define core exceptions because exceptions only support nominal subtyping (subclassing) while obspec only supports structural subtyping (protocols).

But we can release an obspec with undefined exceptions, so people can do anything on top of obspec as long as they don't need a list of "permitted exceptions".

kylebarron avatar May 06 '25 19:05 kylebarron

https://github.com/developmentseed/obspec/issues/12

kylebarron avatar May 06 '25 19:05 kylebarron