inline_threshold kwarg not changing result of SingleHdf5ToZarr call
I'm trying to create a reference dict with inlined data, but the inline_threshold argument to SingleHdf5ToZarr doesn't seem to be doing anything. I feel like I'm missing something obvious.
In [1]: import xarray as xr
In [2]: ds = xr.tutorial.open_dataset('air_temperature')
In [3]: ds.to_netcdf('air.nc')
/Users/tom/miniconda3/envs/dev3.11/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3548: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs
exec(code_obj, self.user_global_ns, self.user_ns)
In [4]: from kerchunk.hdf import SingleHdf5ToZarr
In [5]: refs = SingleHdf5ToZarr('air.nc', spec=1, inline_threshold=0).translate()
In [6]: refs_inlined = SingleHdf5ToZarr('air.nc', spec=1, inline_threshold=1e9).translate()
In [7]: refs == refs_inlined
Out[7]: True
I was expecting these two dictionaries to be different - refs_inlined to have actual data values in it.
I just ran your example with kerchunk==0.2.2 and got inlining and refs == refs_inlined to be False. @TomNicholas just to check, were you running kerchunk==0.2.4?
Unless this is behavior is expected, it seems like there might be an issue with in-lining introduced in 0.2.3.
Good question @norlandrhagen - I should have reported the version
I used
In [9]: kerchunk.__version__
Out[9]: '0.2.2.post55'
What does that post55 even mean?
No idea! We could try pinning 0.2.2 and running the test suite.
The CI in https://github.com/TomNicholas/VirtualiZarr/actions/runs/8572561510/job/23495399856?pr=73 looks like it created a references dict with no inlining (i.e. the same behaviour that I see locally). And that's with kerchunk=0.2.4
kerchunk<=0.2.2 worked. kerchunk>=0.2.3 failed to create inlining. 🤷
Let's make a reproducer in the form of a test. This rings a bell, something I thought I had fixed.
The code above is a reproducer of the bug.
I would be happy to make a PR, but I would need guidance on how to actually fix the bug.
I just came to this issue and tried to reproduce the original issue on main. It works as expected there, but I did see the behavior that Tom is describing on 2.3.0 and 2.4.0. Seems likely that it was fixed here: https://github.com/fsspec/kerchunk/pull/432 which was just after the 2.4.0 release.
I wrote a test with a little subset of air.nc (ds.sel(time="2014-10", lat=slice(100, 50), lon=slice(250, 300)))