rioxarray icon indicating copy to clipboard operation
rioxarray copied to clipboard

Confusion between CachingFileManager lock and rasterio file handle lock

Open ricardog opened this issue 3 years ago • 7 comments

This could easily be my confusion. When opening a rioxarray Dataset it uses the same lock for the CachingFileManager lock and for protecting reads (or writes) from the rasterio file handle.

I think the CachingFileHandle lock should be per cache, since the lock protects concurrent access to the cache (c.f. file_manager.py). If I open multiple Datasets and pass a different lock to each, then threads could collide while accessing the file manager cache.

On the other hand, we want a per-process per file-handle lock for rasterio so concurrent accesses to different file-handles are possible. Note that (again, I think) the current implementation is too restrictive because it reuses the same lock for all sub-datasets of a file. Since each sub-dataset is a distinct rasterio file-handle, it's safe to access them in parallel in the same process.

ricardog avatar Jun 30 '21 15:06 ricardog

@ricardog do you have a reproducible example of this issue?

snowman2 avatar Jun 30 '21 16:06 snowman2

@ricardog do you have a reproducible example of this issue?

I don't have an example that triggers the race condition on CachingFileManager. I would have to think how to trigger it.

Here's the documentation from CachingFileManager:

        lock : duck-compatible threading.Lock, optional
            Lock to use when modifying the cache inside acquire() and close().
            By default, uses a new threading.Lock() object. If set, this object
            should be pickleable.

That makes it seem as if the lock and the caching structure go together (and not with the file handle being accessed).

ricardog avatar Jun 30 '21 16:06 ricardog

If I understand correctly. this is purely theoretical and hasn't been something you have run into personally?

snowman2 avatar Jun 30 '21 16:06 snowman2

Correct, found by code inspection.

ricardog avatar Jul 01 '21 13:07 ricardog

Have you tried these examples?

  • https://corteva.github.io/rioxarray/stable/examples/dask_read_write.html
  • https://corteva.github.io/rioxarray/stable/examples/read-locks.html

snowman2 avatar Jul 01 '21 16:07 snowman2

Related:

>>> subds = rasterio.open('HDF4_EOS:EOS_GRID:test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf:MODIS_Grid_500m_2D:num_observations_500m')
>>> subds.files
['test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf']
>>> subds.name
'HDF4_EOS:EOS_GRID:test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf:MODIS_Grid_500m_2D:num_observations_500m'

snowman2 avatar Nov 11 '22 16:11 snowman2

https://github.com/pydata/xarray/blob/main/xarray/backends/file_manager.py#L156-L165

Seems that there are reasons to have multiple file handles depending on the options used to open the file.

snowman2 avatar Nov 11 '22 16:11 snowman2