h5netcdf icon indicating copy to clipboard operation
h5netcdf copied to clipboard

Speedup accesses to h5ds by creating a hard reference to it.

Open hmaarrfk opened this issue 3 years ago • 12 comments

I mostly just want to run the tests.

If we can write a test that edits a dataset in a way that passes before, but fails here, I would gladly move away from this approach. But ultimately, I think it would be greatly beneficial

  • [ ] Closes Issue #xxxx
  • [ ] Tests added
  • [ ] Changes are documented in CHANGELOG.rst

hmaarrfk avatar Dec 02 '22 22:12 hmaarrfk

Sigh. this current branch would break xarray's test suite:

pytest --pyargs xarray -x -k "TestH5NetCDFData and test_refresh_from_disk" -v
Failed test log
_____________________________________ TestH5NetCDFData.test_refresh_from_disk _____________________________________

self = <xarray.tests.test_backends.TestH5NetCDFData object at 0x7f1a4691d370>

    @pytest.mark.skipif(
        ON_WINDOWS, reason="Windows does not allow modifying open files"
    )
    def test_refresh_from_disk(self) -> None:
        # regression test for https://github.com/pydata/xarray/issues/4862
    
        with create_tmp_file() as example_1_path:
            with create_tmp_file() as example_1_modified_path:
    
                with open_example_dataset("example_1.nc") as example_1:
                    self.save(example_1, example_1_path)
    
                    example_1.rh.values += 100
                    self.save(example_1, example_1_modified_path)
    
                a = open_dataset(example_1_path, engine=self.engine).load()
    
                # Simulate external process modifying example_1.nc while this script is running
                shutil.copy(example_1_modified_path, example_1_path)
    
                # Reopen example_1.nc (modified) as `b`; note that `a` has NOT been closed
                b = open_dataset(example_1_path, engine=self.engine).load()
    
                try:
>                   assert not np.array_equal(a.rh.values, b.rh.values)
E                   AssertionError: assert not True

hmaarrfk avatar Dec 05 '22 13:12 hmaarrfk

@hmaarrfk That's unfortunate, I've hoped that it might just work.

I'm currently traveling, but might have some time to check with h5py how h5netcdf might get an information if something changes at dataset level.

kmuehlbauer avatar Dec 05 '22 14:12 kmuehlbauer

debugging code
import shutil
import xarray
import os 
import h5netcdf

example_1_path = "a.nc"
example_1_modified_path = "b.nc"

example_1 = xarray.open_dataset(
        os.path.join(os.path.dirname(xarray.__file__), "tests", "data", "example_1.nc"),
    )
example_1.to_netcdf(example_1_path, engine='h5netcdf')
example_1.rh.values += 100
example_1.to_netcdf(example_1_modified_path, engine='h5netcdf')
c = h5netcdf.File(example_1_modified_path)
c_rh = c["rh"][...]
assert np.array_equal(example_1.rh.values, c_rh)
c.close()

a = h5netcdf.File(example_1_path, 'r')
a_rh = a["rh"][...]
# a = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# a_rh = a.rh.values

# Simulate external process modifying example_1.nc while this script is running
shutil.copy(example_1_modified_path, example_1_path)

# Reopen example_1.nc (modified) as `b`; note that `a` has NOT been closed
# b = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# b_rh = b.rh.values
b = h5netcdf.File(example_1_path, 'r')
b_rh = b["rh"][...]

b_netcdf = xarray.open_dataset(example_1_path, engine='netcdf4')
b_netcdf_rh = b_netcdf.rh.values

assert not np.array_equal(a_rh, b_rh)

It seems that the same test with h5py directly passes:

h5py file
import shutil
import xarray
import os 
import h5py

example_1_path = "a.nc"
example_1_modified_path = "b.nc"

example_1 = xarray.open_dataset(
        os.path.join(os.path.dirname(xarray.__file__), "tests", "data", "example_1.nc"),
    )
example_1.to_netcdf(example_1_path, engine='h5netcdf')
example_1.rh.values += 100
example_1.to_netcdf(example_1_modified_path, engine='h5netcdf')
c = h5py.File(example_1_modified_path)
c_rh = c["/rh"][...]
assert np.array_equal(example_1.rh.values, c_rh)
c.close()

a = h5py.File(example_1_path, 'r')
a_rh = a["/rh"][...]
# a = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# a_rh = a.rh.values

# Simulate external process modifying example_1.nc while this script is running
shutil.copy(example_1_modified_path, example_1_path)

# Reopen example_1.nc (modified) as `b`; note that `a` has NOT been closed
# b = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# b_rh = b.rh.values
b = h5py.File(example_1_path, 'r')
b_rh = b["/rh"][...]

b_netcdf = xarray.open_dataset(example_1_path, engine='netcdf4')
b_netcdf_rh = b_netcdf.rh.values

assert not np.array_equal(a_rh, b_rh)

hmaarrfk avatar Dec 05 '22 14:12 hmaarrfk

Actually, i think it has to do with how h5py loads data. It seems to cache file handles or something:

import shutil
import xarray
import os 
import h5py

example_1_path = "a.nc"
example_1_modified_path = "b.nc"

example_1 = xarray.open_dataset(
        os.path.join(os.path.dirname(xarray.__file__), "tests", "data", "example_1.nc"),
    )
example_1.to_netcdf(example_1_path, engine='h5netcdf')
example_1.rh.values += 100
example_1.to_netcdf(example_1_modified_path, engine='h5netcdf')
c = h5py.File(example_1_modified_path)
c_rh = c["/rh"][...]
assert np.array_equal(example_1.rh.values, c_rh)
c.close()

a = h5py.File(example_1_path, 'r')
a_rh = a["/rh"][...]
# a = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# a_rh = a.rh.values

# Simulate external process modifying example_1.nc while this script is running
shutil.copy(example_1_modified_path, example_1_path)

# Reopen example_1.nc (modified) as `b`; note that `a` has NOT been closed
# b = xarray.open_dataset(example_1_path, engine='h5netcdf').load()
# b_rh = b.rh.values
b = h5py.File(example_1_path, 'r')
b_rh = b["/rh"][...]

b_netcdf = xarray.open_dataset(example_1_path, engine='netcdf4')
b_netcdf_rh = b_netcdf.rh.values

# Compare the old data
assert not np.array_equal(a_rh, b_rh)

# Compare newly loaded data
a_rh = a["/rh"][...]
b_rh = b["/rh"][...]
assert not np.array_equal(a_rh, b_rh)

will fail

hmaarrfk avatar Dec 05 '22 14:12 hmaarrfk

https://github.com/pydata/xarray/discussions/7359

hmaarrfk avatar Dec 05 '22 15:12 hmaarrfk

@hmaarrfk It looks like all h5netcdf tests within the xarray-nightly run get skipped. This hopefully get's fixed in #201.

kmuehlbauer avatar Dec 08 '22 06:12 kmuehlbauer

Trying to close -> reopen to toggle CI.

kmuehlbauer avatar Dec 08 '22 06:12 kmuehlbauer

It looks like all h5netcdf tests within the xarray-nightly run get skipped. This hopefully get's fixed in #201.

This is working now, so we get the failures directly within our CI.

kmuehlbauer avatar Dec 08 '22 07:12 kmuehlbauer

Should we ping: https://github.com/pydata/xarray/discussions/7359

hmaarrfk avatar May 31 '23 15:05 hmaarrfk

I'm not sure what the way forward is.

I do recall that I felt that this was a bug in xarray's understanding of how files work with hdf5.

is there a world where this can be made into an option?

this gives me 150ms speedup on a 500ms load time which is quite significant

hmaarrfk avatar Oct 10 '23 10:10 hmaarrfk

my current theory is that the handle I am creating a hard reference to is stopping the file from getting closed. which in turn is causing a trigger to HDF5's file pointer caching to kick in, stopping the new file from being read.

hmaarrfk avatar Oct 10 '23 12:10 hmaarrfk

my current theory is that the handle I am creating a hard reference to is stopping the file from getting closed. which in turn is causing a trigger to HDF5's file pointer caching to kick in, stopping the new file from being read.

@hmaarrfk Thanks for trying to get behind this. Your reasoning makes sense to me, but I do not have a suggestions on how to move this forward, unfortuantely.

kmuehlbauer avatar Oct 17 '23 07:10 kmuehlbauer