DimensionalData.jl icon indicating copy to clipboard operation
DimensionalData.jl copied to clipboard

FR: Support saving a DimArray or a DimTree as a hdf5

Open alex-s-gardner opened this issue 10 months ago • 23 comments

Working with DD is fantastic but it can be rather awkward to serialize. I wonder if DD could support saving as an HDF5 with limits on what datatypes can be saved? Maybe this is out of scope or too complicated to implement... but it would be vary nice to have if possible (maybe YAXarray serialization can be used as an template).

alex-s-gardner avatar Jan 07 '25 11:01 alex-s-gardner

Probably as Zarr as well. Maybe we can just move Rasters backends here but ignore CF standards and other geo things

rafaqz avatar Jan 13 '25 10:01 rafaqz

Yeah, that sounds like a good idea - maybe we can commit to Xarray standards here? (specifically _ARRAY_DIMENSIONS)

Happy to help build this out if we know what we need to do, this should also make saving arbitrary datatypes easier, so long as they have a Zarr codec/writer.

asinghvi17 avatar Jan 13 '25 11:01 asinghvi17

I've been thinking about this. We need a generalised and extensible writer backend for all AbstractDimArray. So people could add their own package that defines a DD compatible writer. What we have in Rasters.jl at the moment is a start.

rafaqz avatar Feb 09 '25 12:02 rafaqz

You could have a look at YAXArrayBase as inspiration. We should also have an extensible interface to read data into an AbstractDimType

@meggart said, we should rather have a writer for AbstractDimDatasets and not write on the DimArray level, because the Dataset matches better with the structure of the disk data types.

We would also like to have an interface for a skeleton mode, where we would only write the metadata and the dimensions of a Dataset so that we can fill the data later.

felixcremer avatar Mar 24 '25 15:03 felixcremer

Yeah, we need to combine YAX and Rasters readers into a more general DimensionalData/DiskArrays reader/writer interface. I'm not sure where these things live, and if they would all be separate packages.

The skeleton mode is a good idea too. I've been thinking about how to do disk-backed DimTree so you could build a dimensional object on disk iteratively.

rafaqz avatar Mar 24 '25 18:03 rafaqz

We should probably also support some kind of type-stable reading mode for small binary support! E.g: https://github.com/gbaraldi/SimpleHDF5.jl

asinghvi17 avatar Mar 27 '25 18:03 asinghvi17

disk-backed DimTree

more arguments for lazy = :open :)

asinghvi17 avatar Apr 01 '25 13:04 asinghvi17

Ya I will do it soon! Just have a backlog of DD/Rasters PRs and no time

rafaqz avatar Apr 01 '25 14:04 rafaqz

@meggart, @felixcremer and I spoke about this on April 3 on the ESDL / big raster call. Here are some notes from there:

2 levels: I/O level: extremely simple, only concerned with IO

  • create this array under this key
  • write array to this key
  • write this metadata to this array,
  • read this array from this key,
  • Read metadata from this array.
    Take cues from YAXArraysBase here.
  • Create a group under this group key
  • Retrieve nested dataset under this group key

Data model level: write using CF data model, GDAL data model, Zarr data model, etc.
Then we can write to any format in any data model, such that you can easily swap out e,g CF conventions for GeoZarr conventions, or in a Zarr you could write geometries as GeoJSON chunks, etc.

  • Take cues from Rasters.jl CDM source here.
  • This data model layer should be the only thing that calls the IO functions. Then, we can reuse the CommonDataModel.jl ecosystem by having a CDMCF data model implementation that simply calls out to CDM stuff, while still having a CF data model that we can use for NetCDF, and an Xarray data model that we can use below that for HDF and pure Zarr, for example.
  • Need to generalise groups/trees and nested data access

EDIT: From discussion with @meggart on 18 Sep 2025 during SDSL

CommonDataModel is basically an "IO layer". Rasters decomposition to CDM can be equivalent to some Rasters decomposition to a "hierarchical data model DimTree" that can have many things that can write it out - simple xarray model, CDM, custom formats, etc. etc. We need a good specification on what can and cannot be in the hierarchical data model.

asinghvi17 avatar Apr 09 '25 19:04 asinghvi17

Here are three functions that support DimStack -> NetCDF -> DimStack with preservation of metadata. Happy to open a PR if it's usefull

"""
    dimstack2netcdf(ds::DimStack, filename) -> String

Save a `DimStack` to a NetCDF file, preserving dimensions, data, and metadata.

# Arguments
- `ds::DimStack`: The DimStack to be saved.
- `filename`: Path to the output NetCDF file.

# Returns
- `String`: The filename of the written NetCDF file.

# Details
- All dimensions are defined in the NetCDF file.
- Dimension variables and their metadata are written.
- All variables in the DimStack are written, including their metadata and optional fill values.
- Global attributes from the DimStack metadata are added to the NetCDF file.
"""
function dimstack2netcdf(ds::DimStack, filename)

    NCDataset(filename, "c") do nc
        # get dimensions
        ds_dims = dims(ds)

        # step 1: define dimensions
        for dim in ds_dims
            dname = string(DimensionalData.name(dim))
            defDim(nc, dname, length(dim))
        end

        # step 2: add dim variables & metadata
        for dim in ds_dims
            dname = string(DimensionalData.name(dim))
            d = defVar(nc, dname, val(val(dim)), (dname,))

            # add dimension attributes from dimension metadata
            for (k, v) in DD.metadata(dim)
                d.attrib[k] = v
            end
        end

        # step 3: add variable
        varnames = keys(ds)
        for name in varnames

            var_meta = DD.metadata(ds[name]);

            if haskey(var_meta, "fillvalue")
                v = defVar(nc, string(name), ustrip.(parent(ds[name])), string.(DimensionalData.name.(dims(ds[name]))); fillvalue = var_meta["fillvalue"])
            else
                v = defVar(nc, string(name), ustrip.(parent(ds[name])), string.(DimensionalData.name.(dims(ds[name]))))
            end

            # step 4: add variable attributes from variable metadata
            for (k, var0) in var_meta
                if k != "fillvalue"
                    v.attrib[k] = var0
                end
            end
        end

        # step 5: add global attributes from global metadata
        for (k, val) in DD.metadata(ds)
            nc.attrib[string(k)] = val
        end
    end

    return filename
end


"""
    netcdf2dimstack(filename) -> DimStack

Load a NetCDF file and convert it to a `DimStack`, preserving dimensions, data, and metadata.

# Arguments
- `filename`: Path to the NetCDF file.

# Returns
- `DimStack`: A DimStack containing all non-dimension variables from the NetCDF file, with associated dimensions and metadata.
"""
function netcdf2dimstack(filename)
    ds0 = NCDataset(filename)
    ds = ncdataset2dimstack(ds0)
    return ds
end


"""
    ncdataset2dimstack(ds0) -> DimStack

Convert an open NCDataset (from NCDatasets.jl) to a DimStack, preserving dimensions, data, and metadata.

# Arguments
- `ds0`: An open NCDataset object.

# Returns
- `DimStack`: A DimStack containing all non-dimension variables from the dataset, with associated dimensions and metadata.

# Notes
- Dimension variables are detected and used to construct Dim objects.
- All global and variable-level attributes are preserved in the resulting DimStack and DimArrays.
"""
function ncdataset2dimstack(ds0)
    # get the dimension names
    dnames = NCDatasets.dimnames(ds0)

    dim0 = Dict()
    for dim in dnames
        dim0[dim] = Dim{Symbol(dim)}(ds0[dim])
    end

    vnames = tuple(setdiff(keys(ds0), dnames)...)
    ds = DimStack((; zip(Symbol.(vnames), [DimArray(ds0[vname], getindex.(Ref(dim0), dimnames(ds0[vname])); metadata=Dict(zip(keys(ds0[vname].attrib), getindex.(Ref(ds0[vname].attrib), keys(ds0[vname].attrib))))) for vname in vnames])...); metadata=Dict(zip(keys(ds0.attrib), getindex.(Ref(ds0.attrib), keys(ds0.attrib)))))
    
    return ds
end

alex-s-gardner avatar Sep 17 '25 00:09 alex-s-gardner

I think we need to be very organised about how we do this. We already have two DD packages that load data we dont want to have three!

Rasters will do this with full CF standards now and load almost anything to DD lookups. And we probably want that. and everything YAX does too all in one place.

And at the same time that shouldn't make this package hugely heavy for the non-spatial cases. So we might need a new package.

But any of this would have to be with @meggart and @felixcremer totally on board to be worth it. The question for years has been who has the time and funding to start and finish this huge project coherently.

If someone can fully fund it I will commit to it but this isn't a spare time tinkering project.

rafaqz avatar Sep 17 '25 02:09 rafaqz

Yes, I agree we should be careful that the solution we hash out for DimensionalData would actually replace the use cases of YAXArrays and Rasters or enable Rasters to build on top and add only the geospecific parts that are needed. Therefore I really liked the idea by Anshul about having a data access interface which would handle the plain IO for zarr, hdf5... and then we have on top of that the specification layer.

If you say fully funded how much and how long do you expect this to need?

felixcremer avatar Sep 18 '25 09:09 felixcremer

Hard to say, but at least a month full time, maybe 2, to get both the packages using the same backend with a good design.

(Maybe we should organise a call to plan this if it's a real possibility)

rafaqz avatar Sep 21 '25 08:09 rafaqz

I'm happy to help write a numfocus proposal if needed.

alex-s-gardner avatar Sep 22 '25 19:09 alex-s-gardner

@felixcremer @rafaqz has there been any discussions regarding how to secure funding to implement an I/O backend? I'm happy to help where I can. I see having an I/O backend and transformative for the full ecosystem.

alex-s-gardner avatar Oct 14 '25 17:10 alex-s-gardner

No, it's not something I can chase as I have too much work currently. But if someone organised something I could find time to work on it in the next year.

(So yes that would be really appreciated)

rafaqz avatar Oct 14 '25 22:10 rafaqz

Will this be tied in tied in with JLD2 support?

kapple19 avatar Oct 27 '25 23:10 kapple19

Will this be tied in tied in with JLD2 support?

JLD2 is already supported (at least JLD2 supports saving Types)... the issue is that JLD2 can easily break when packages modify types. I use JLD2 pretty heavily for storing DD data but I also expect it to break at some point.

alex-s-gardner avatar Oct 27 '25 23:10 alex-s-gardner

There isn't an easy way to update a JLD2 file for an updated type, is there? Can't load multiple versions of the same package to convert old file → old data type → new data type → new file?

kapple19 avatar Oct 28 '25 00:10 kapple19

here isn't an easy way to update a JLD2 file for an updated type

Not that I'm aware of... it's a known limitation of JLD2. A more stable serialization (hdf5, netcdf, zarr) is preferred.

alex-s-gardner avatar Oct 28 '25 00:10 alex-s-gardner

Yeah JLD is a pretty bad data format.

We are allowed to break your files even on minor version bumps, and its julia only. Kind of worst of all worlds.

So yes as @alex-s-gardner says best to focus on zarr/hdf5/netcdf etc

rafaqz avatar Oct 28 '25 10:10 rafaqz

At this point I would go for zarr as the standard.

lazarusA avatar Oct 28 '25 10:10 lazarusA

Zarr is great but I don't think it is a wholesale replacement for hdf5/netcdf. hdf5 is self contained (one file), easy to share, and highly performant for block storage (and object store when the metadata is structured correctly)... a lot of DD users will want to save small datasets that are a bit messy when saved locally as a Zarr. Don't get me wrong... I love Zarr but I don't think we can get away from hdf5 support.

alex-s-gardner avatar Oct 28 '25 16:10 alex-s-gardner