kerchunk
kerchunk copied to clipboard
Shifted/missing chunks after running MultiZarrToZarr
Hey 👋, first of all, thanks for this awesome project! It really makes working with large collections of data so much easier and I greatly appreciate the effort!
Unfortunately, I am currently facing an issue using kerchunk for creating an aggregated view over ERA5 reanalysis data. I have attached a script and two files to serve as a minimal reproducer. I strongly suspect I am using kerchunk wrong at the moment, however, I had a hard time figuring out what I am doing wrong exactly.
The minimal example contains two netCDF files for maximum daily temperature (tasmax, internally chunked/stored as (31, 2, 2) for (time, lat, lon) dimensions, respectively) and I am following the example from the Quickstart. I create single-file references using SingleHdf5ToZarr
and then merge them using MultiZarrToZarr
. However, when I then open my merged file, the values for 2003 are correct, but when looking at the values for 2004, they appear to be shifted or even missing (values after 2004-12-08 are all NaN, even though they are present in the original file).
There is one place where I am deviating from the example code, namely using coo_map={"time": "cf:time"}
instead of concat_dims=["time"]
. When I use the latter, what happens is that my time dimension ranges from 2003-01-01 to 2004-01-01. I suspect this is happening as both source files have time units days since <year>-01-01
, meaning that the time dimension will look like range(0, 365)
and range(0,366)
for 2003 and 2004 (leap year), respectively. Therefor, if not parsing this axis using cftime right away, only the underlying integer values (days since datum) will be assigned, thus causing my time dimension to only have 366 entries. However, when I convert the time to be aligned to the same start datum, i.e. both years have same units of days since 2003-01-01 00:00:00
, I can use concat_dims=["time"]
right away, however, my underlying issue still exists, so I suppose this is not the problem.
I would greatly appreciate if you could point me to a direction for solving my problem?
Code from reproducer:
import xarray as xr
import numpy as np
import ujson
import fsspec
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
if __name__ == "__main__":
ds2003 = xr.open_dataset("era5_tasmax_2003.nc")
ds2004 = xr.open_dataset("era5_tasmax_2004.nc")
fs = fsspec.filesystem("file")
flist = fs.glob("era5_tasmax_*.nc")
# Generate single file references
for fname in flist:
with fs.open(fname) as infile:
chunks = SingleHdf5ToZarr(infile, fname)
outf = f"{fname.split('/')[-1]}.json"
with open(outf, "wb") as f:
f.write(ujson.dumps(chunks.translate()).encode())
json_list = fs.glob("*.json")
# Merge to one file
mzz = MultiZarrToZarr(
json_list,
coo_map={"time": "cf:time"},
identical_dims=["lat", "lon"],
remote_protocol="file"
)
with open("era5_tasmax_2003-2004.json", "wb") as f:
f.write(ujson.dumps(mzz.translate()).encode())
ds_merged = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "era5_tasmax_2003-2004.json"}})
original_sel = ds2004.sel(time="2004-12-08").tasmax.compute()
merged_sel = ds_merged.sel(time="2004-12-08").tasmax.compute()
print(original_sel)
print(merged_sel)
I am using the following libraries:
# packages in environment at <....>\test_kerchunk:
#
# Name Version Build Channel
asciitree 0.3.3 py_2 conda-forge
blosc 1.21.5 hdccc3a2_0 conda-forge
bzip2 1.0.8 hcfcfb64_5 conda-forge
ca-certificates 2024.2.2 h56e8100_0 conda-forge
cached-property 1.5.2 hd8ed1ab_1 conda-forge
cached_property 1.5.2 pyha770c72_1 conda-forge
certifi 2024.2.2 pyhd8ed1ab_0 conda-forge
cftime 1.6.3 py311h59ca53f_0 conda-forge
entrypoints 0.4 pyhd8ed1ab_0 conda-forge
fasteners 0.17.3 pyhd8ed1ab_0 conda-forge
fsspec 2024.2.0 pyhca7485f_0 conda-forge
h5py 3.10.0 nompi_py311h7195302_101 conda-forge
hdf4 4.2.15 h5557f11_7 conda-forge
hdf5 1.14.3 nompi_h73e8ff5_100 conda-forge
intel-openmp 2024.0.0 h57928b3_49841 conda-forge
kerchunk 0.2.3 pyhd8ed1ab_0 conda-forge
krb5 1.21.2 heb0366b_0 conda-forge
libaec 1.1.2 h63175ca_1 conda-forge
libblas 3.9.0 21_win64_mkl conda-forge
libcblas 3.9.0 21_win64_mkl conda-forge
libcurl 8.5.0 hd5e4a3a_0 conda-forge
libexpat 2.5.0 h63175ca_1 conda-forge
libffi 3.4.2 h8ffe710_5 conda-forge
libhwloc 2.9.3 default_haede6df_1009 conda-forge
libiconv 1.17 hcfcfb64_2 conda-forge
libjpeg-turbo 3.0.0 hcfcfb64_1 conda-forge
liblapack 3.9.0 21_win64_mkl conda-forge
libnetcdf 4.9.2 nompi_h07c049d_113 conda-forge
libsqlite 3.45.1 hcfcfb64_0 conda-forge
libssh2 1.11.0 h7dfc565_0 conda-forge
libxml2 2.12.5 hc3477c8_0 conda-forge
libzip 1.10.1 h1d365fa_3 conda-forge
libzlib 1.2.13 hcfcfb64_5 conda-forge
lz4-c 1.9.4 hcfcfb64_0 conda-forge
mkl 2024.0.0 h66d3029_49657 conda-forge
msgpack-python 1.0.7 py311h005e61a_0 conda-forge
netcdf4 1.6.5 nompi_py311he019f65_100 conda-forge
numcodecs 0.11.0 py311h12c1d0e_1 conda-forge
numpy 1.26.4 py311h0b4df5a_0 conda-forge
openssl 3.2.1 hcfcfb64_0 conda-forge
packaging 23.2 pyhd8ed1ab_0 conda-forge
pandas 2.2.1 py311hf63dbb6_0 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
pthreads-win32 2.9.1 hfa6e2cd_3 conda-forge
python 3.11.8 h2628c8c_0_cpython conda-forge
python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge
python-tzdata 2024.1 pyhd8ed1ab_0 conda-forge
python_abi 3.11 4_cp311 conda-forge
pytz 2024.1 pyhd8ed1ab_0 conda-forge
setuptools 69.1.1 pyhd8ed1ab_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
snappy 1.1.10 hfb803bf_0 conda-forge
tbb 2021.11.0 h91493d7_1 conda-forge
tk 8.6.13 h5226925_1 conda-forge
tzdata 2024a h0c530f3_0 conda-forge
ucrt 10.0.22621.0 h57928b3_0 conda-forge
ujson 5.9.0 py311h12c1d0e_0 conda-forge
vc 14.3 hcf57466_18 conda-forge
vc14_runtime 14.38.33130 h82b7239_18 conda-forge
vs2015_runtime 14.38.33130 hcb4865c_18 conda-forge
wheel 0.42.0 pyhd8ed1ab_0 conda-forge
xarray 2024.2.0 pyhd8ed1ab_0 conda-forge
xz 5.2.6 h8d14728_0 conda-forge
zarr 2.17.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.13 hcfcfb64_5 conda-forge
zstd 1.5.5 h12be248_0 conda-forge
Answering my own question in part: by changing the chunksize of the time domain to 1, everything suddenly works as expected. My guess is that since the chunksize of 31 does not evenly divide the 365/366 days of a year, there is something going wrong with the created zarr file such that offsets from chunks are calculated weirdly? I have absolutely no idea about the internals of zarr, kerchunk and whatever else is involved, but I will leave this open for now as I am curious what may be the culprit.
Interestingly, a different dataset which has spatial chunking and time chunksize of 1 works just fine (even though in this case the spatial chunks divide the dimensions evenly), so I suspect it may be down to chunksize not properly dividing dimension sizes?
If I understand correctly, you are hitting the limitation with zarr: all chunks must have the same dimensions (except the last in any given axis). This means that, if each constituent dataset has an incomplete last chunk, you cannot combine them. ZEP003 is my effort to address this, and I have a POC implementation, but no movement yet on getting it into zarr-python.
Hey @martindurant Thanks a lot for your input! I guess now that I am thinking about it, the limitation does make sense to some extent as that was likely not what zarr was designed for initially.
Is this documented somewhere in the kerchunk docs? If not, I would be happy to add that as I consider it something of a footgun if you are actually in control of creating the datasets yourself and may be good knowing about?