rioxarray
rioxarray copied to clipboard
Confusion between CachingFileManager lock and rasterio file handle lock
This could easily be my confusion. When opening a rioxarray Dataset it uses the same lock for the CachingFileManager lock and for protecting reads (or writes) from the rasterio file handle.
I think the CachingFileHandle lock should be per cache, since the lock protects concurrent access to the cache (c.f. file_manager.py). If I open multiple Datasets and pass a different lock to each, then threads could collide while accessing the file manager cache.
On the other hand, we want a per-process per file-handle lock for rasterio so concurrent accesses to different file-handles are possible. Note that (again, I think) the current implementation is too restrictive because it reuses the same lock for all sub-datasets of a file. Since each sub-dataset is a distinct rasterio file-handle, it's safe to access them in parallel in the same process.
@ricardog do you have a reproducible example of this issue?
@ricardog do you have a reproducible example of this issue?
I don't have an example that triggers the race condition on CachingFileManager. I would have to think how to trigger it.
Here's the documentation from CachingFileManager:
lock : duck-compatible threading.Lock, optional
Lock to use when modifying the cache inside acquire() and close().
By default, uses a new threading.Lock() object. If set, this object
should be pickleable.
That makes it seem as if the lock and the caching structure go together (and not with the file handle being accessed).
If I understand correctly. this is purely theoretical and hasn't been something you have run into personally?
Correct, found by code inspection.
Have you tried these examples?
- https://corteva.github.io/rioxarray/stable/examples/dask_read_write.html
- https://corteva.github.io/rioxarray/stable/examples/read-locks.html
Related:
>>> subds = rasterio.open('HDF4_EOS:EOS_GRID:test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf:MODIS_Grid_500m_2D:num_observations_500m')
>>> subds.files
['test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf']
>>> subds.name
'HDF4_EOS:EOS_GRID:test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf:MODIS_Grid_500m_2D:num_observations_500m'
https://github.com/pydata/xarray/blob/main/xarray/backends/file_manager.py#L156-L165
Seems that there are reasons to have multiple file handles depending on the options used to open the file.