xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Cache files for different CachingFileManager objects separately

Open shoyer opened this issue 4 years ago • 10 comments

This means that explicitly opening a file multiple times with open_dataset (e.g., after modifying it on disk) now reopens the file from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray object. We don't need this for handling many files in Dask (the original motivation for caching), because in those cases only a single CachingFileManager is created.

I think this should some long-standing usability issues: #4240, #4862

Conveniently, this also obviates the need for some messy reference counting logic.

  • [x] Closes #4240, #4862
  • [x] Tests added
  • [x] Passes pre-commit run --all-files
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst

shoyer avatar Feb 07 '21 21:02 shoyer

Thank you for the feedback! I quickly tested your suggested fix against the script I refered to in my original issue, and it's still behaving the same if I'm not mistaken. I looked very quickly so perhaps I'm wrong, but what I seem to understand is that your fix is similar to an idea my colleague @huard had, which was to make the cached item more granular by adding a call to Path(..).stat() in the cache key tuple (the idea being that if the file has changed on disk between the two open calls, this will detect it). It doesn't work because (I think) it doesn't change the fact that the underlying netcdf file is never explicitly close, that is, this line is never called:

https://github.com/pydata/xarray/blob/a5f53e203c52a7605d5db799864046471115d04f/xarray/backends/file_manager.py#L222

Sorry in advance if something in my analysis is wrong, which is very likely!

cjauvin avatar Feb 07 '21 22:02 cjauvin

As my colleague @huard suggested, I have written an additional test which demonstrates the problem (essentially the same idea I proposed in my initial issue):

https://github.com/pydata/xarray/compare/master...cjauvin:add-netcdf-refresh-test

As I explained in the issue I have a potential fix for the problem:

https://github.com/pydata/xarray/compare/master...cjauvin:netcdf-caching-bug

but the problem is that it feels a bit weird to have to that, so I suspect that there's a better way to solve it.

cjauvin avatar Feb 08 '21 21:02 cjauvin

Thanks for sharing the test case, which I'll add to this PR. I ran it locally and it seems to pass with this PR for the netCDF4 and SciPy netCDF backends (but not h5netcdf). So I'm not entirely sure what to make of that.

On Mon, Feb 8, 2021 at 1:56 PM Christian Jauvin [email protected] wrote:

As my colleague @huard https://github.com/huard suggested, I have written an additional test which demonstrates the problem (essentially the same idea I proposed in my initial issue https://github.com/pydata/xarray/issues/4862):

master...cjauvin:add-netcdf-refresh-test https://github.com/pydata/xarray/compare/master...cjauvin:add-netcdf-refresh-test

As I explained in the issue I have a potential fix for the problem:

master...cjauvin:netcdf-caching-bug https://github.com/pydata/xarray/compare/master...cjauvin:netcdf-caching-bug

but the problem is that it feels a bit weird to have to that, so I suspect that there's a better way to solve it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/4879#issuecomment-775489828, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJFVTISNAZV4YYOUMURLLS6BMZHANCNFSM4XH4MWIA .

shoyer avatar Feb 09 '21 08:02 shoyer

I added @cjauvin's integration test, and verified that the fix works for the scipy and h5netcdf backends.

Unfortunately, it doesn't work yet for the netCDF4 backend. I don't think we can solve this in Xarray without fixes netCDF4-Python or the netCDF-C library: https://github.com/Unidata/netcdf4-python/issues/1195

shoyer avatar Sep 27 '22 17:09 shoyer

I don't think we can solve this in Xarray without fixes netCDF4-Python or the netCDF-C library:

I think we should document this and merge. Though the test failures are real (having trouble cleaning up on windows when deleting the temp file), and the diff includes some zarr files right now.

dcherian avatar Sep 28 '22 17:09 dcherian

OK, after a bit more futzing tests are passing and I think this is actually ready to go in. I ended up leaving in the reference counting after all -- I couldn't figure out another way to keep files open after a pickle round-trip.

shoyer avatar Oct 05 '22 16:10 shoyer

~~Actually maybe we don't need to keep files open after pickling... let me give this one more try.~~

Nevermind, this didn't work -- it still results in failing tests with dask-distributed on Windows.

shoyer avatar Oct 05 '22 17:10 shoyer

Anyone want to review here? I think this should be ready to go in.

shoyer avatar Oct 05 '22 22:10 shoyer

mypy error:

xarray/backends/file_manager.py:277: error: Accessing "init" on an instance is unsound, since instance.init could be from an incompatible subclass [misc]

for

    def __setstate__(self, state):
        """Restore from a pickle."""
        opener, args, mode, kwargs, lock = state
        self.__init__(opener, *args, mode=mode, kwargs=kwargs, lock=lock)

Seems like we can just ignore?

dcherian avatar Oct 13 '22 17:10 dcherian

I think we could fix this by marking CachingFileManager with typing.final

shoyer avatar Oct 13 '22 21:10 shoyer