Speedup accesses to h5ds by creating a hard reference to it.
I mostly just want to run the tests.
If we can write a test that edits a dataset in a way that passes before, but fails here, I would gladly move away from this approach. But ultimately, I think it would be greatly beneficial
- [ ] Closes Issue #xxxx
- [ ] Tests added
- [ ] Changes are documented in
CHANGELOG.rst
Sigh. this current branch would break xarray's test suite:
pytest --pyargs xarray -x -k "TestH5NetCDFData and test_refresh_from_disk" -v
Failed test log
_____________________________________ TestH5NetCDFData.test_refresh_from_disk _____________________________________
self = <xarray.tests.test_backends.TestH5NetCDFData object at 0x7f1a4691d370>
@pytest.mark.skipif(
ON_WINDOWS, reason="Windows does not allow modifying open files"
)
def test_refresh_from_disk(self) -> None:
# regression test for https://github.com/pydata/xarray/issues/4862
with create_tmp_file() as example_1_path:
with create_tmp_file() as example_1_modified_path:
with open_example_dataset("example_1.nc") as example_1:
self.save(example_1, example_1_path)
example_1.rh.values += 100
self.save(example_1, example_1_modified_path)
a = open_dataset(example_1_path, engine=self.engine).load()
# Simulate external process modifying example_1.nc while this script is running
shutil.copy(example_1_modified_path, example_1_path)
# Reopen example_1.nc (modified) as `b`; note that `a` has NOT been closed
b = open_dataset(example_1_path, engine=self.engine).load()
try:
> assert not np.array_equal(a.rh.values, b.rh.values)
E AssertionError: assert not True
@hmaarrfk That's unfortunate, I've hoped that it might just work.
I'm currently traveling, but might have some time to check with h5py how h5netcdf might get an information if something changes at dataset level.
debugging code
import shutil
import xarray
import os
import h5netcdf
example_1_path = "a.nc"
example_1_modified_path = "b.nc"
example_1 = xarray.open_dataset(
os.path.join(os.path.dirname(xarray.__file__), "tests", "data", "example_1.nc"),
)
example_1.to_netcdf(example_1_path, engine='h5netcdf')
example_1.rh.values += 100
example_1.to_netcdf(example_1_modified_path, engine='h5netcdf')
c = h5netcdf.File(example_1_modified_path)
c_rh = c["rh"][...]
assert np.array_equal(example_1.rh.values, c_rh)
c.close()
a = h5netcdf.File(example_1_path, 'r')
a_rh = a["rh"][...]
# a = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# a_rh = a.rh.values
# Simulate external process modifying example_1.nc while this script is running
shutil.copy(example_1_modified_path, example_1_path)
# Reopen example_1.nc (modified) as `b`; note that `a` has NOT been closed
# b = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# b_rh = b.rh.values
b = h5netcdf.File(example_1_path, 'r')
b_rh = b["rh"][...]
b_netcdf = xarray.open_dataset(example_1_path, engine='netcdf4')
b_netcdf_rh = b_netcdf.rh.values
assert not np.array_equal(a_rh, b_rh)
It seems that the same test with h5py directly passes:
h5py file
import shutil
import xarray
import os
import h5py
example_1_path = "a.nc"
example_1_modified_path = "b.nc"
example_1 = xarray.open_dataset(
os.path.join(os.path.dirname(xarray.__file__), "tests", "data", "example_1.nc"),
)
example_1.to_netcdf(example_1_path, engine='h5netcdf')
example_1.rh.values += 100
example_1.to_netcdf(example_1_modified_path, engine='h5netcdf')
c = h5py.File(example_1_modified_path)
c_rh = c["/rh"][...]
assert np.array_equal(example_1.rh.values, c_rh)
c.close()
a = h5py.File(example_1_path, 'r')
a_rh = a["/rh"][...]
# a = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# a_rh = a.rh.values
# Simulate external process modifying example_1.nc while this script is running
shutil.copy(example_1_modified_path, example_1_path)
# Reopen example_1.nc (modified) as `b`; note that `a` has NOT been closed
# b = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# b_rh = b.rh.values
b = h5py.File(example_1_path, 'r')
b_rh = b["/rh"][...]
b_netcdf = xarray.open_dataset(example_1_path, engine='netcdf4')
b_netcdf_rh = b_netcdf.rh.values
assert not np.array_equal(a_rh, b_rh)
Actually, i think it has to do with how h5py loads data. It seems to cache file handles or something:
import shutil
import xarray
import os
import h5py
example_1_path = "a.nc"
example_1_modified_path = "b.nc"
example_1 = xarray.open_dataset(
os.path.join(os.path.dirname(xarray.__file__), "tests", "data", "example_1.nc"),
)
example_1.to_netcdf(example_1_path, engine='h5netcdf')
example_1.rh.values += 100
example_1.to_netcdf(example_1_modified_path, engine='h5netcdf')
c = h5py.File(example_1_modified_path)
c_rh = c["/rh"][...]
assert np.array_equal(example_1.rh.values, c_rh)
c.close()
a = h5py.File(example_1_path, 'r')
a_rh = a["/rh"][...]
# a = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# a_rh = a.rh.values
# Simulate external process modifying example_1.nc while this script is running
shutil.copy(example_1_modified_path, example_1_path)
# Reopen example_1.nc (modified) as `b`; note that `a` has NOT been closed
# b = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# b_rh = b.rh.values
b = h5py.File(example_1_path, 'r')
b_rh = b["/rh"][...]
b_netcdf = xarray.open_dataset(example_1_path, engine='netcdf4')
b_netcdf_rh = b_netcdf.rh.values
# Compare the old data
assert not np.array_equal(a_rh, b_rh)
# Compare newly loaded data
a_rh = a["/rh"][...]
b_rh = b["/rh"][...]
assert not np.array_equal(a_rh, b_rh)
will fail
https://github.com/pydata/xarray/discussions/7359
@hmaarrfk It looks like all h5netcdf tests within the xarray-nightly run get skipped. This hopefully get's fixed in #201.
Trying to close -> reopen to toggle CI.
It looks like all h5netcdf tests within the xarray-nightly run get skipped. This hopefully get's fixed in #201.
This is working now, so we get the failures directly within our CI.
Should we ping: https://github.com/pydata/xarray/discussions/7359
I'm not sure what the way forward is.
I do recall that I felt that this was a bug in xarray's understanding of how files work with hdf5.
is there a world where this can be made into an option?
this gives me 150ms speedup on a 500ms load time which is quite significant
my current theory is that the handle I am creating a hard reference to is stopping the file from getting closed. which in turn is causing a trigger to HDF5's file pointer caching to kick in, stopping the new file from being read.
my current theory is that the handle I am creating a hard reference to is stopping the file from getting closed. which in turn is causing a trigger to HDF5's file pointer caching to kick in, stopping the new file from being read.
@hmaarrfk Thanks for trying to get behind this. Your reasoning makes sense to me, but I do not have a suggestions on how to move this forward, unfortuantely.