kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

inline_threshold kwarg not changing result of SingleHdf5ToZarr call

Open TomNicholas opened this issue 1 year ago • 7 comments

I'm trying to create a reference dict with inlined data, but the inline_threshold argument to SingleHdf5ToZarr doesn't seem to be doing anything. I feel like I'm missing something obvious.

In [1]: import xarray as xr

In [2]: ds = xr.tutorial.open_dataset('air_temperature')

In [3]: ds.to_netcdf('air.nc')
/Users/tom/miniconda3/envs/dev3.11/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3548: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs
  exec(code_obj, self.user_global_ns, self.user_ns)

In [4]: from kerchunk.hdf import SingleHdf5ToZarr

In [5]: refs = SingleHdf5ToZarr('air.nc', spec=1, inline_threshold=0).translate()

In [6]: refs_inlined = SingleHdf5ToZarr('air.nc', spec=1, inline_threshold=1e9).translate()

In [7]: refs == refs_inlined
Out[7]: True

I was expecting these two dictionaries to be different - refs_inlined to have actual data values in it.

TomNicholas avatar Apr 05 '24 16:04 TomNicholas

I just ran your example with kerchunk==0.2.2 and got inlining and refs == refs_inlined to be False. @TomNicholas just to check, were you running kerchunk==0.2.4?

Unless this is behavior is expected, it seems like there might be an issue with in-lining introduced in 0.2.3.

norlandrhagen avatar Apr 05 '24 17:04 norlandrhagen

Good question @norlandrhagen - I should have reported the version

I used

In [9]: kerchunk.__version__
Out[9]: '0.2.2.post55'

What does that post55 even mean?

TomNicholas avatar Apr 05 '24 18:04 TomNicholas

No idea! We could try pinning 0.2.2 and running the test suite.

norlandrhagen avatar Apr 05 '24 18:04 norlandrhagen

The CI in https://github.com/TomNicholas/VirtualiZarr/actions/runs/8572561510/job/23495399856?pr=73 looks like it created a references dict with no inlining (i.e. the same behaviour that I see locally). And that's with kerchunk=0.2.4

TomNicholas avatar Apr 05 '24 18:04 TomNicholas

kerchunk<=0.2.2 worked. kerchunk>=0.2.3 failed to create inlining. 🤷

norlandrhagen avatar Apr 05 '24 18:04 norlandrhagen

Let's make a reproducer in the form of a test. This rings a bell, something I thought I had fixed.

martindurant avatar Apr 06 '24 00:04 martindurant

The code above is a reproducer of the bug.

I would be happy to make a PR, but I would need guidance on how to actually fix the bug.

TomNicholas avatar Apr 06 '24 00:04 TomNicholas

I just came to this issue and tried to reproduce the original issue on main. It works as expected there, but I did see the behavior that Tom is describing on 2.3.0 and 2.4.0. Seems likely that it was fixed here: https://github.com/fsspec/kerchunk/pull/432 which was just after the 2.4.0 release.

I wrote a test with a little subset of air.nc (ds.sel(time="2014-10", lat=slice(100, 50), lon=slice(250, 300)))

jsignell avatar May 10 '24 19:05 jsignell