Zarr.jl icon indicating copy to clipboard operation
Zarr.jl copied to clipboard

Expose url/filename path and group name/path in standardised functions

Open rafaqz opened this issue 4 months ago • 6 comments

Currently ZarrDatasets.jl doesn't implement CommonDataModel.path or CommonDataModel.groupname, which means eventually Rasters.jl cant load a lazy Zarr dataset because it doesn't know where to get it again.

But, it seems there are a bunch or storage backends with different fields, so we need an abstraction for this for each of those. And it should probably be here rather than in ZarrDatasets.jl?

rafaqz avatar Aug 23 '25 04:08 rafaqz

I am still not sure what this is really about. The whole philosophy of Zarr.jl is to only have lazy handles from the start. This is really unlike GDAL of NetCDF where you carry around some C pointers. Any Storage object you work with only contains paths to the resource, so no need to wrap this into another lazy layer.

meggart avatar Aug 25 '25 07:08 meggart

I think what Rafael is asking for is, given a specific ZArray, can you get the path to that array within the dataset?

asinghvi17 avatar Aug 25 '25 13:08 asinghvi17

Exactly. There is no standard way to get the path and group for a zarr dataset once you have an object. So ZarrDarasets.jl can't implement the CommonDataModel.jl interface properly or share information with other packages, so the actual reason isn't even that relevant: we just need it to finish the interface.

But the specific use case is for @felixcremer. Imagine someone has a Zarr, Gribb, Netcdf or ArchGdal dataset, and they pass it to Rasters.jl but want it to be lazy not read all at once. Then Rasters needs to know what group and path it has and the variable name to be generically lazy.

I know Zarr doesn't need the wrapper, but netcdf and gdal does. And CommonDataModel.jl provides a standardised wrapper across all backend types now so actually ZarrDatasets.jl not implementing those methods means we cant use them for anything without special casing.

And I don't want lots of separate code for all of the backends.

We will eventually add another "just keep the file open but don't read the data" kind of lazy loading option in the constructor for Raster that accepts a dataset. That means basically nothing different for Zarr and is what you imagine we should do anyway. But it will keep the open netcdf which will cause problems. (So isn't the recommended generic approach)

rafaqz avatar Aug 25 '25 14:08 rafaqz

Mostly it's about having a consistent set of abstractions across all the potentially CF compliant data standards.

Sometimes that will be redundant for some backends. Its like how DiskArrays barely makes sense for Gribb but we have it for consistency anyway.

rafaqz avatar Aug 25 '25 14:08 rafaqz

What I think we need is the string that we would call zopen on to get the same ZGroup that we started with. For a toplevel of a zarr this seems to be zgroup.storage.parent.url but for subgroups this is some combination of this url and the zgroup.path.

felixcremer avatar Aug 25 '25 16:08 felixcremer

Yes, but CDM puts the path and group name into different function calls (I think?)There is a groupname method at least. Maybe we don't need that for Zarr but the very we are least need the path/url with the same function

rafaqz avatar Aug 25 '25 17:08 rafaqz