zarr-python Unable to append to a very large zarr store

Zarr version

v2.16.1

Numcodecs version

v0.12.0

Python Version

3.11.6

Operating System

Linux

Installation

using conda

Description

I have a very large (~1 Pb) zarr store on Google Cloud Storage here, containing output from a reanalysis (weather model fields like air temperature, wind velocity, etc). I am using xarray and generally following this guidance to append to the zarr store along the time dimension. The existing dataset has ~30 years of data at 3 hour frequency and 1/4 degree resolution, and I just want to append ~two more months of data, so appending a relatively small amount of data to a huge existing zarr store.

I tried this locally with a small example and it was fine (given as an example below), but I can't seem to do it with this large dataset. I'm executing the initial append step (highlighted as problematic down below) from a c2-standard-60 node (240 GB RAM) on google cloud, but it never properly completes the task, and the node eventually becomes unresponsive. Any tips on how to do something like this would be very helpful, and please let me know if I should post this somewhere else. Thanks in advance!

Steps to reproduce

import xarray as xr

# this is the existing dataset
ds = xr.open_zarr(
    "gcs://noaa-ufs-gefsv13replay/ufs-hr1/0.25-degree/03h-freq/zarr/fv3.zarr",
    storage_options={"token": "anon"},
)

# grab just 2 time stamps of the data, store locally
# this is an example to mimic the existing dataset on GCS
ds[["tmp"]].isel(time=slice(2)).to_zarr("test.zarr")

# now get the next two time stamps and append
# this is the step that never completes for the real thing
xds = ds[["tmp"]].isel(time=slice(2, 4)).load();
(np.nan * xds).to_zarr("test.zarr", append_dim="time", compute=False) # <- this is the operation that never completes

# this is what I'll eventually do to actually fill the appended container with values
for i in range(2,4):
    region = {
        "time":slice(i, i+1),
        "pfull": slice(None, None),
        "grid_yt": slice(None, None),
        "grid_xt": slice(None, None),
    }
    xds.isel(time=[i-2]).to_zarr("test.zarr", region=region)

Additional output

No response

Mar 28 '24 14:03 timothyas

Hi @timothyas - sorry this never got any traction. Did you manage to find a work around? I noticed this dataset is now available on GCS!

Oct 18 '24 00:10 jhamman

Hi @jhamman, thanks for reaching out! I haven't had time to dedicate to this, since it hasn't been on our critical path. I had success in modifying xarray-beam in order to append to a subsampled (i.e., smaller) version of the dataset, and I'm hoping that we can find a way to scale that workflow up. I can follow up on this issue with what we find.

That said, it could be the case that icechunk updates from Earthmover help resolve this? I'm looking forward to the webinar on Tuesday to learn more.

PS Yep, the more public-friendly webpage for the dataset is here and gist describing the data available on GCS here, which is also linked from that other site

Oct 18 '24 21:10 timothyas

@timothyas - sounds good. FWIW, I just ran your reproducer above fine on my laptop so I'd say try it again, perhaps this just works now.

That said, it could be the case that icechunk updates from Earthmover help resolve this? I'm looking forward to the webinar on Tuesday to learn more.

Glad you are going to make the webinar. Icechunk is going to be a great help for workloads like this that need to incrementally update a large dataset but want to do it safely (i.e. with transactions).

Oct 18 '24 22:10 jhamman

Sorry, the reproducer is unfortunately not ideal. It succeeds for me too for small-medium sized datasets, but fails at scale. I'm not sure exactly what scale the true threshold is, but it definitely fails for the ~1 PB dataset I was trying to do it with. Either way, we'll trying it with a larger cluster, and it also sounds like icechunk should help with this task. I'm excited to learn more.

Oct 18 '24 22:10 timothyas