rioxarray
rioxarray copied to clipboard
Segfault when using MemoryFile objects
Code Sample, a copy-pastable example if possible
from matplotlib import pyplot
import rioxarray
from rasterio.io import MemoryFile
filename = "test.jp2"
def read_file(filename):
with open(filename, "rb") as of:
data = of.read()
memfile = MemoryFile(data)
src = memfile.open()
proj = rioxarray.open_rasterio(src, chunks=2048)
return proj.squeeze("band")
proj = read_file(filename)
proj.plot.imshow()
pyplot.show()
Problem description
The above results in a segmentation fault every time (at least on my laptop)
Expected Output
I expect the display of my data.
Environment Information
rioxarray (0.11.2.dev0) deps: rasterio: 1.2.10 xarray: 2022.3.0 GDAL: 3.4.2 GEOS: None PROJ: None PROJ DATA: None GDAL DATA: None
Other python deps: scipy: 1.8.0 pyproj: 3.3.1
System: python: 3.10.2 | packaged by conda-forge | (main, Feb 1 2022, 19:28:35) [GCC 9.4.0] executable: /home/a001673/mambaforge/envs/satpy/bin/python machine: Linux-4.18.0-372.16.1.el8_6.x86_64-x86_64-with-glibc2.28
Installation method
- from a fresh git clone, with
pip install -e .
I suspect this is due to gdal's memory file object being disposed off before dask can make use of it...
Just for clarification and some context: The opening of the file and reading to put the data in the MemoryFile is of course contrived, and just comes from the fact that I need to be able to pass open files to rioxarray. In my case, I want to use fsspec in the background (https://filesystem-spec.readthedocs.io/en/latest/) so as to be able to read remote files on the fly.
See: https://github.com/rasterio/rasterio/discussions/2517#discussioncomment-3137456 for an example of using MemoryFile using with
.
@mraspaud, with this addition to rasterio 1.3 (recently released), I believe you should be able to use fsspec files directly (related #246):
def read_file(filename):
with open(filename, "rb") as src:
proj = rioxarray.open_rasterio(src, chunks=2048)
return proj.squeeze("band")
I tried using the context, but that resulted in an empty data array being returned (which I think is quite logical this the context will be closed at the time of plotting).
The example you provide does not seem to work (so passing a regular open file to open_rasterio
):
...
File "/home/a001673/usr/src/rioxarray/rioxarray/_io.py", line 959, in open_rasterio
result = _prepare_dask(result, riods, filename, chunks)
File "/home/a001673/usr/src/rioxarray/rioxarray/_io.py", line 664, in _prepare_dask
mtime = os.path.getmtime(filename)
File "/home/a001673/mambaforge/envs/satpy/lib/python3.10/genericpath.py", line 55, in getmtime
return os.stat(filename).st_mtime
TypeError: stat: path should be string, bytes, os.PathLike or integer, not BufferedReader
Also rasterio seems to be having problems with opened jp2 files (tif works fine), so I'll be reporting an issue for that on rasterio when I get a minimal case.
But thanks a lot for the heads up on @djhoese 's amazing contribution :)
The example you provide does not seem to work (so passing a regular open file to open_rasterio):
Haven't tested that feature with dask ... looks like we need a patch to handle that.
I tried using the context, but that resulted in an empty data array being returned (which I think is quite logical this the context will be closed at the time of plotting).
Sounds like you may need to add a .load() call after open_rasterio
.
I think using .load()
would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.
I think using
.load()
would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.
Yes exactly.
Edited for brevity, "reproducible" code, and a possible solution:
Regarding @snowman2's suggestion to use with
and fsspec
, I've had trouble with rioxarray while using this pattern when reading from private Google Cloud Storage buckets. Interestingly it does not seem to happen for xarray.open_dataset()
, and when using rioxarray.open_rasterio()
it seems like file-like objects are being closed before subsequent operations can go back to retrieve data from them when needed. See below.
Using rioxarray
to open geotiffs from a private Google Cloud Storage bucket, the resulting DataArray
opens, and I can even access the coordinate values, but accessing the information that would show up in a plot is not possible:
gcp_creds = {'token': '/path/to/token.json'}
fs_args = {'project': 'my-gcs-project'}
storage_options = {**gcp_creds, **fs_args}
protocol= 'gcs'
load_args = {}
fs = fsspec.filesystem(protocol, **storage_options)
load_path = 'my_bucket_name/path/to/file.tif'
with fs.open(load_path) as fs_file:
da = rioxarray.open_rasterio(fs_file)
Note the file successfully opens but when I try to do da.plot()
the following error is generated:
Exception ignored in: 'rasterio._filepath.filepath_read'
Traceback (most recent call last):
File "/home/jpolly/miniconda3/envs/windninja/lib/python3.9/site-packages/fsspec/spec.py", line 1539, in read
raise ValueError("I/O operation on closed file.")
ValueError: I/O operation on closed file.
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~0
As a work around to this I've found the following pattern does what I want:
gcp_creds = {'token': '/path/to/token.json'}
fs_args = {'project': 'my-gcs-project'}
storage_options = {**gcp_creds, **fs_args}
protocol= 'gcs'
load_args = {}
fs = fsspec.filesystem(protocol, **storage_options)
load_path = 'my_bucket_name/path/to/file.tif'
da = rioxarray.open_rasterio(fs.open(load_path, **fs_open_args_load), **load_args)
Note that this solution fails when trying to pass any size for 'chunks'
as a load argument, e.g., load_args = {'chunks': 2048}
.
I can understand how the use of with
leads to a file being closed and the ValueError("I/O operation on closed file.")
error. It is strange to me that the same pattern works with xarray but not rioxarray, and that chunking does not seem to work.
Is this just something rioxarray currently does not support?
#558 should address this: https://github.com/corteva/rioxarray/issues/550#issuecomment-1187572731
I think using
.load()
would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.
From what I can tell, calling .load()
is equivalent to the initial code snippet above:
with open(filename, "rb") as of:
data = of.read()
Since the data is lazily loaded by rioxarray.open_rasterio
, calling .load()
is the only way to load in the data before closing the file handle in the current state of the code.
Ideas to improve this behavior are definitely welcome. It may be worth looking into how xarray handles this scenario as they may have found a solution for this.
I think using
.load()
would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.From what I can tell, calling
.load()
is equivalent to the initial code snippet above:with open(filename, "rb") as of: data = of.read()
Yes, it is, however, the of.read()
was just a means to get data into the MemoryFile
to illustrate the problem.
In the real case, we want to keep things lazy as much as possible. I believe with your example here: https://github.com/corteva/rioxarray/issues/550#issuecomment-1187446210 and the fix in #550 I'm good.
Ideas to improve this behavior are definitely welcome. It may be worth looking into how xarray handles this scenario as they may have found a solution for this.
Yes, I think they do, but I don't remember how.
I think this is resolved. If something is missing still, please comment.