DimensionalData.jl
DimensionalData.jl copied to clipboard
FR: Support saving a DimArray or a DimTree as a hdf5
Working with DD is fantastic but it can be rather awkward to serialize. I wonder if DD could support saving as an HDF5 with limits on what datatypes can be saved? Maybe this is out of scope or too complicated to implement... but it would be vary nice to have if possible (maybe YAXarray serialization can be used as an template).
Probably as Zarr as well. Maybe we can just move Rasters backends here but ignore CF standards and other geo things
Yeah, that sounds like a good idea - maybe we can commit to Xarray standards here? (specifically _ARRAY_DIMENSIONS)
Happy to help build this out if we know what we need to do, this should also make saving arbitrary datatypes easier, so long as they have a Zarr codec/writer.
I've been thinking about this. We need a generalised and extensible writer backend for all AbstractDimArray. So people could add their own package that defines a DD compatible writer. What we have in Rasters.jl at the moment is a start.
You could have a look at YAXArrayBase as inspiration. We should also have an extensible interface to read data into an AbstractDimType
@meggart said, we should rather have a writer for AbstractDimDatasets and not write on the DimArray level, because the Dataset matches better with the structure of the disk data types.
We would also like to have an interface for a skeleton mode, where we would only write the metadata and the dimensions of a Dataset so that we can fill the data later.
Yeah, we need to combine YAX and Rasters readers into a more general DimensionalData/DiskArrays reader/writer interface. I'm not sure where these things live, and if they would all be separate packages.
The skeleton mode is a good idea too. I've been thinking about how to do disk-backed DimTree so you could build a dimensional object on disk iteratively.
We should probably also support some kind of type-stable reading mode for small binary support! E.g: https://github.com/gbaraldi/SimpleHDF5.jl
disk-backed DimTree
more arguments for lazy = :open :)
Ya I will do it soon! Just have a backlog of DD/Rasters PRs and no time
@meggart, @felixcremer and I spoke about this on April 3 on the ESDL / big raster call. Here are some notes from there:
2 levels: I/O level: extremely simple, only concerned with IO
- create this array under this key
- write array to this key
- write this metadata to this array,
- read this array from this key,
- Read metadata from this array.
Take cues from YAXArraysBase here. - Create a group under this group key
- Retrieve nested dataset under this group key
Data model level: write using CF data model, GDAL data model, Zarr data model, etc.
Then we can write to any format in any data model, such that you can easily swap out
e,g CF conventions for GeoZarr conventions, or in a Zarr you could write geometries as GeoJSON chunks, etc.
- Take cues from Rasters.jl CDM source here.
- This data model layer should be the only thing that calls the IO functions. Then, we can reuse the CommonDataModel.jl ecosystem by having a CDMCF data model implementation that simply calls out to CDM stuff, while still having a CF data model that we can use for NetCDF, and an Xarray data model that we can use below that for HDF and pure Zarr, for example.
- Need to generalise groups/trees and nested data access
EDIT: From discussion with @meggart on 18 Sep 2025 during SDSL
CommonDataModel is basically an "IO layer". Rasters decomposition to CDM can be equivalent to some Rasters decomposition to a "hierarchical data model DimTree" that can have many things that can write it out - simple xarray model, CDM, custom formats, etc. etc. We need a good specification on what can and cannot be in the hierarchical data model.
Here are three functions that support DimStack -> NetCDF -> DimStack with preservation of metadata. Happy to open a PR if it's usefull
"""
dimstack2netcdf(ds::DimStack, filename) -> String
Save a `DimStack` to a NetCDF file, preserving dimensions, data, and metadata.
# Arguments
- `ds::DimStack`: The DimStack to be saved.
- `filename`: Path to the output NetCDF file.
# Returns
- `String`: The filename of the written NetCDF file.
# Details
- All dimensions are defined in the NetCDF file.
- Dimension variables and their metadata are written.
- All variables in the DimStack are written, including their metadata and optional fill values.
- Global attributes from the DimStack metadata are added to the NetCDF file.
"""
function dimstack2netcdf(ds::DimStack, filename)
NCDataset(filename, "c") do nc
# get dimensions
ds_dims = dims(ds)
# step 1: define dimensions
for dim in ds_dims
dname = string(DimensionalData.name(dim))
defDim(nc, dname, length(dim))
end
# step 2: add dim variables & metadata
for dim in ds_dims
dname = string(DimensionalData.name(dim))
d = defVar(nc, dname, val(val(dim)), (dname,))
# add dimension attributes from dimension metadata
for (k, v) in DD.metadata(dim)
d.attrib[k] = v
end
end
# step 3: add variable
varnames = keys(ds)
for name in varnames
var_meta = DD.metadata(ds[name]);
if haskey(var_meta, "fillvalue")
v = defVar(nc, string(name), ustrip.(parent(ds[name])), string.(DimensionalData.name.(dims(ds[name]))); fillvalue = var_meta["fillvalue"])
else
v = defVar(nc, string(name), ustrip.(parent(ds[name])), string.(DimensionalData.name.(dims(ds[name]))))
end
# step 4: add variable attributes from variable metadata
for (k, var0) in var_meta
if k != "fillvalue"
v.attrib[k] = var0
end
end
end
# step 5: add global attributes from global metadata
for (k, val) in DD.metadata(ds)
nc.attrib[string(k)] = val
end
end
return filename
end
"""
netcdf2dimstack(filename) -> DimStack
Load a NetCDF file and convert it to a `DimStack`, preserving dimensions, data, and metadata.
# Arguments
- `filename`: Path to the NetCDF file.
# Returns
- `DimStack`: A DimStack containing all non-dimension variables from the NetCDF file, with associated dimensions and metadata.
"""
function netcdf2dimstack(filename)
ds0 = NCDataset(filename)
ds = ncdataset2dimstack(ds0)
return ds
end
"""
ncdataset2dimstack(ds0) -> DimStack
Convert an open NCDataset (from NCDatasets.jl) to a DimStack, preserving dimensions, data, and metadata.
# Arguments
- `ds0`: An open NCDataset object.
# Returns
- `DimStack`: A DimStack containing all non-dimension variables from the dataset, with associated dimensions and metadata.
# Notes
- Dimension variables are detected and used to construct Dim objects.
- All global and variable-level attributes are preserved in the resulting DimStack and DimArrays.
"""
function ncdataset2dimstack(ds0)
# get the dimension names
dnames = NCDatasets.dimnames(ds0)
dim0 = Dict()
for dim in dnames
dim0[dim] = Dim{Symbol(dim)}(ds0[dim])
end
vnames = tuple(setdiff(keys(ds0), dnames)...)
ds = DimStack((; zip(Symbol.(vnames), [DimArray(ds0[vname], getindex.(Ref(dim0), dimnames(ds0[vname])); metadata=Dict(zip(keys(ds0[vname].attrib), getindex.(Ref(ds0[vname].attrib), keys(ds0[vname].attrib))))) for vname in vnames])...); metadata=Dict(zip(keys(ds0.attrib), getindex.(Ref(ds0.attrib), keys(ds0.attrib)))))
return ds
end
I think we need to be very organised about how we do this. We already have two DD packages that load data we dont want to have three!
Rasters will do this with full CF standards now and load almost anything to DD lookups. And we probably want that. and everything YAX does too all in one place.
And at the same time that shouldn't make this package hugely heavy for the non-spatial cases. So we might need a new package.
But any of this would have to be with @meggart and @felixcremer totally on board to be worth it. The question for years has been who has the time and funding to start and finish this huge project coherently.
If someone can fully fund it I will commit to it but this isn't a spare time tinkering project.
Yes, I agree we should be careful that the solution we hash out for DimensionalData would actually replace the use cases of YAXArrays and Rasters or enable Rasters to build on top and add only the geospecific parts that are needed. Therefore I really liked the idea by Anshul about having a data access interface which would handle the plain IO for zarr, hdf5... and then we have on top of that the specification layer.
If you say fully funded how much and how long do you expect this to need?
Hard to say, but at least a month full time, maybe 2, to get both the packages using the same backend with a good design.
(Maybe we should organise a call to plan this if it's a real possibility)
I'm happy to help write a numfocus proposal if needed.
@felixcremer @rafaqz has there been any discussions regarding how to secure funding to implement an I/O backend? I'm happy to help where I can. I see having an I/O backend and transformative for the full ecosystem.
No, it's not something I can chase as I have too much work currently. But if someone organised something I could find time to work on it in the next year.
(So yes that would be really appreciated)
Will this be tied in tied in with JLD2 support?
Will this be tied in tied in with JLD2 support?
JLD2 is already supported (at least JLD2 supports saving Types)... the issue is that JLD2 can easily break when packages modify types. I use JLD2 pretty heavily for storing DD data but I also expect it to break at some point.
There isn't an easy way to update a JLD2 file for an updated type, is there? Can't load multiple versions of the same package to convert old file → old data type → new data type → new file?
here isn't an easy way to update a JLD2 file for an updated type
Not that I'm aware of... it's a known limitation of JLD2. A more stable serialization (hdf5, netcdf, zarr) is preferred.
Yeah JLD is a pretty bad data format.
We are allowed to break your files even on minor version bumps, and its julia only. Kind of worst of all worlds.
So yes as @alex-s-gardner says best to focus on zarr/hdf5/netcdf etc
At this point I would go for zarr as the standard.
Zarr is great but I don't think it is a wholesale replacement for hdf5/netcdf. hdf5 is self contained (one file), easy to share, and highly performant for block storage (and object store when the metadata is structured correctly)... a lot of DD users will want to save small datasets that are a bit messy when saved locally as a Zarr. Don't get me wrong... I love Zarr but I don't think we can get away from hdf5 support.