rioxarray
rioxarray copied to clipboard
Confusion between CachingFileManager lock and rasterio file handle lock
This could easily be my confusion. When opening a rioxarray
Dataset
it uses the same lock for the CachingFileManager
lock and for protecting reads (or writes) from the rasterio
file handle.
I think the CachingFileHandle
lock should be per cache, since the lock protects concurrent access to the cache (c.f. file_manager.py). If I open multiple Datasets
and pass a different lock to each, then threads could collide while accessing the file manager cache.
On the other hand, we want a per-process per file-handle lock for rasterio
so concurrent accesses to different file-handles are possible. Note that (again, I think) the current implementation is too restrictive because it reuses the same lock for all sub-datasets of a file. Since each sub-dataset is a distinct rasterio
file-handle, it's safe to access them in parallel in the same process.
@ricardog do you have a reproducible example of this issue?
@ricardog do you have a reproducible example of this issue?
I don't have an example that triggers the race condition on CachingFileManager
. I would have to think how to trigger it.
Here's the documentation from CachingFileManager
:
lock : duck-compatible threading.Lock, optional
Lock to use when modifying the cache inside acquire() and close().
By default, uses a new threading.Lock() object. If set, this object
should be pickleable.
That makes it seem as if the lock and the caching structure go together (and not with the file handle being accessed).
If I understand correctly. this is purely theoretical and hasn't been something you have run into personally?
Correct, found by code inspection.
Have you tried these examples?
- https://corteva.github.io/rioxarray/stable/examples/dask_read_write.html
- https://corteva.github.io/rioxarray/stable/examples/read-locks.html
Related:
>>> subds = rasterio.open('HDF4_EOS:EOS_GRID:test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf:MODIS_Grid_500m_2D:num_observations_500m')
>>> subds.files
['test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf']
>>> subds.name
'HDF4_EOS:EOS_GRID:test/test_data/input/MOD09GA.A2008296.h14v17.006.2015181011753.hdf:MODIS_Grid_500m_2D:num_observations_500m'
https://github.com/pydata/xarray/blob/main/xarray/backends/file_manager.py#L156-L165
Seems that there are reasons to have multiple file handles depending on the options used to open the file.