Append to an icechunk store at the leaf group level using a virtual DataTree
Problem
Very seduced by the xr.DataTree abstraction level, I'm trying to represent the NASA NEX dataset as such using virtualizarr, and store it in an icechunk store whose groups are the datatree's nodes. I do it in two main steps, see the code for a minimal example below. Note that I'm only using the local filesystem to experiment.
- Initiate the store with the right structure, and coordinates at the right level to take advantage of coordinates inheritance.
- Populate the leaves with a virtual dataset - THIS is the pb since I cannot update a tree directly
The second step fails using the great VirtualiZarrDataTreeAccessor.to_icechunk (which probably orginates from issue #244) since I don't see any way to append using a group argument (like VirtualiZarrDatasetAccessor.to_icechunk) or a mode="a" like in xr.DataTree.to_zarr. Basically, I do the following getting the expected ContainsGroupError:
# nex_dt is a datatree with the NEX data directory tree structure
# e.g. the same as session.store, with the `tas` leaves populated with a dataset.
# session.store has the exact same groups structure
nex_dt.vz.to_icechunk(session.store, validate_containers=False)
ContainsGroupError: A group exists in store <icechunk.store.IcechunkStore object at 0x125523ee0> at path ''.
I tried with the two following tree structures where the leaves groups ( tas) are initialized (empty) or non-existent:
/
├── IPSL-CM6A-LR
│ ├── historical
│ │ ├── r1i1p1f1
│ │ │ └── tas
│ │ └── time (23741,) float64
│ └── ssp585
│ ├── r1i1p1f1
│ │ └── tas
│ └── time (31411,) float64
├── lat (600,) float64
└── lon (1440,) float64
/
├── IPSL-CM6A-LR
│ ├── historical
│ │ ├── r1i1p1f1
│ │ └── time (23741,) float64
│ └── ssp585
│ ├── r1i1p1f1
│ └── time (31411,) float64
├── lat (600,) float64
└── lon (1440,) float64
Note that I'm aware that I could directly create the store with a fully "filled-up" datatree using xr.DataTree.from_dict, but I want to understand what I can conveniently append or not, and I plan to divide the commits when I'm doing it for real.
Attempts
I tried multiple ways:
- The preferred method:
VirtualiZarrDataTreeAccessor.to_icechunkas decribed above - Trying to use
zarrdirectly to be able to append withto_zarr - The working workaround: iterate over the leaves and write the dataset (not the tree) using
VirtualiZarrDatasetAccessor.to_icechunkand thegroup=argument.
First native approach:
nex_dt.vz.to_icechunk(session.store, validate_containers=False)
ContainsGroupError: A group exists in store <icechunk.store.IcechunkStore object at 0x125523ee0> at path ''.
Second naive approach: here I suppose I might be able to solve the issue specifying the encoding. But this would likely make the re-import as a virtual datatree difficult (xr.open_datatree or the future vz.open_virtual_datatree #84).
nex_dt.to_zarr(
session_troubleshoot.store,
mode="a",
zarr_format=3,
consolidated=False,
validate_containers=False)
ValueError: could not convert string to float: 'AAAAgB2vFUQ='
Working workaround: iterate over the leaves, remove the empty group and write the dataset
for dt in nex_dt.leaves:
group = dt.path
ds = dt.dataset
root = zarr.open_group(store, mode="a")
print(f"Removing group {group}")
del root[group]
print(f"Adding dataset {group}")
# Restore coordinates inheritance, e.g. don't write them here
ds.drop_vars(["lon", "lat", "time"]).vz.to_icechunk(store, group=group, validate_containers=False)
Here I get the wanted result:
/
├── IPSL-CM6A-LR
│ ├── historical
│ │ ├── r1i1p1f1
│ │ │ └── tas
│ │ │ └── tas (23741, 600, 1440) float32
│ │ └── time (23741,) float64
│ └── ssp585
│ ├── r1i1p1f1
│ │ └── tas
│ │ └── tas (31411, 600, 1440) float32
│ └── time (31411,) float64
├── lat (600,) float64
└── lon (1440,) float64
Question
Having to go back to a per-leaf/group dataset to be able to update the representation on disk reduces the relevance of the DataTree abstraction. Am I missing something? Is my expectation of having a one-liner update of the disk representation of datatree (given the same structure) wrong?
Disclaimer: I'm quite new to this whole xarray + zarr + icechunk + virtualizarr great world, please forgive any heresy.
Versions
- Python 3.13.7
-
virtualizarr2.1.1 -
icechunk1.1.1 (not the last version at the time of writing) - macOS Sequoia 15.6
I think we simply have not yet got around to implementing the append_dim or group kwargs on VirtualiZarrDatasetAccessor.to_icechunk, and you need both for your use case. If you're interested then we would take a PR for either!
In the meantime I think your best option is unfortunately to go back to the leaf/dataset abstraction.