rioxarray icon indicating copy to clipboard operation
rioxarray copied to clipboard

Segfault when using MemoryFile objects

Open mraspaud opened this issue 1 year ago • 13 comments

Code Sample, a copy-pastable example if possible

from matplotlib import pyplot
import rioxarray
from rasterio.io import MemoryFile

filename = "test.jp2"

def read_file(filename):
    with open(filename, "rb") as of:
        data = of.read()
        memfile = MemoryFile(data)
        src = memfile.open()
        proj = rioxarray.open_rasterio(src, chunks=2048)
        return proj.squeeze("band")

proj = read_file(filename)
proj.plot.imshow()
pyplot.show()

Problem description

The above results in a segmentation fault every time (at least on my laptop)

Expected Output

I expect the display of my data.

Environment Information

rioxarray (0.11.2.dev0) deps: rasterio: 1.2.10 xarray: 2022.3.0 GDAL: 3.4.2 GEOS: None PROJ: None PROJ DATA: None GDAL DATA: None

Other python deps: scipy: 1.8.0 pyproj: 3.3.1

System: python: 3.10.2 | packaged by conda-forge | (main, Feb 1 2022, 19:28:35) [GCC 9.4.0] executable: /home/a001673/mambaforge/envs/satpy/bin/python machine: Linux-4.18.0-372.16.1.el8_6.x86_64-x86_64-with-glibc2.28

Installation method

  • from a fresh git clone, with pip install -e .

mraspaud avatar Jul 18 '22 11:07 mraspaud

I suspect this is due to gdal's memory file object being disposed off before dask can make use of it...

mraspaud avatar Jul 18 '22 11:07 mraspaud

Just for clarification and some context: The opening of the file and reading to put the data in the MemoryFile is of course contrived, and just comes from the fact that I need to be able to pass open files to rioxarray. In my case, I want to use fsspec in the background (https://filesystem-spec.readthedocs.io/en/latest/) so as to be able to read remote files on the fly.

mraspaud avatar Jul 18 '22 11:07 mraspaud

See: https://github.com/rasterio/rasterio/discussions/2517#discussioncomment-3137456 for an example of using MemoryFile using with.

@mraspaud, with this addition to rasterio 1.3 (recently released), I believe you should be able to use fsspec files directly (related #246):

def read_file(filename):
    with open(filename, "rb") as src:
        proj = rioxarray.open_rasterio(src, chunks=2048)
        return proj.squeeze("band")

snowman2 avatar Jul 18 '22 13:07 snowman2

I tried using the context, but that resulted in an empty data array being returned (which I think is quite logical this the context will be closed at the time of plotting).

The example you provide does not seem to work (so passing a regular open file to open_rasterio):

...
  File "/home/a001673/usr/src/rioxarray/rioxarray/_io.py", line 959, in open_rasterio
    result = _prepare_dask(result, riods, filename, chunks)
  File "/home/a001673/usr/src/rioxarray/rioxarray/_io.py", line 664, in _prepare_dask
    mtime = os.path.getmtime(filename)
  File "/home/a001673/mambaforge/envs/satpy/lib/python3.10/genericpath.py", line 55, in getmtime
    return os.stat(filename).st_mtime
TypeError: stat: path should be string, bytes, os.PathLike or integer, not BufferedReader

Also rasterio seems to be having problems with opened jp2 files (tif works fine), so I'll be reporting an issue for that on rasterio when I get a minimal case.

mraspaud avatar Jul 18 '22 14:07 mraspaud

But thanks a lot for the heads up on @djhoese 's amazing contribution :)

mraspaud avatar Jul 18 '22 14:07 mraspaud

The example you provide does not seem to work (so passing a regular open file to open_rasterio):

Haven't tested that feature with dask ... looks like we need a patch to handle that.

snowman2 avatar Jul 18 '22 15:07 snowman2

I tried using the context, but that resulted in an empty data array being returned (which I think is quite logical this the context will be closed at the time of plotting).

Sounds like you may need to add a .load() call after open_rasterio.

snowman2 avatar Jul 18 '22 15:07 snowman2

I think using .load() would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.

djhoese avatar Jul 18 '22 16:07 djhoese

I think using .load() would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.

Yes exactly.

mraspaud avatar Jul 18 '22 18:07 mraspaud

Edited for brevity, "reproducible" code, and a possible solution:

Regarding @snowman2's suggestion to use with and fsspec, I've had trouble with rioxarray while using this pattern when reading from private Google Cloud Storage buckets. Interestingly it does not seem to happen for xarray.open_dataset(), and when using rioxarray.open_rasterio() it seems like file-like objects are being closed before subsequent operations can go back to retrieve data from them when needed. See below.

Using rioxarray to open geotiffs from a private Google Cloud Storage bucket, the resulting DataArray opens, and I can even access the coordinate values, but accessing the information that would show up in a plot is not possible:

gcp_creds = {'token': '/path/to/token.json'}
fs_args = {'project': 'my-gcs-project'}
storage_options = {**gcp_creds, **fs_args}
protocol= 'gcs'

load_args = {}
fs = fsspec.filesystem(protocol, **storage_options)
load_path = 'my_bucket_name/path/to/file.tif'
with fs.open(load_path) as fs_file:
    da = rioxarray.open_rasterio(fs_file)

Note the file successfully opens but when I try to do da.plot() the following error is generated:

Exception ignored in: 'rasterio._filepath.filepath_read'
Traceback (most recent call last):
  File "/home/jpolly/miniconda3/envs/windninja/lib/python3.9/site-packages/fsspec/spec.py", line 1539, in read
    raise ValueError("I/O operation on closed file.")
ValueError: I/O operation on closed file.
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~0

As a work around to this I've found the following pattern does what I want:

gcp_creds = {'token': '/path/to/token.json'}
fs_args = {'project': 'my-gcs-project'}
storage_options = {**gcp_creds, **fs_args}
protocol= 'gcs'

load_args = {}
fs = fsspec.filesystem(protocol, **storage_options)
load_path = 'my_bucket_name/path/to/file.tif'
da = rioxarray.open_rasterio(fs.open(load_path, **fs_open_args_load), **load_args)

Note that this solution fails when trying to pass any size for 'chunks' as a load argument, e.g., load_args = {'chunks': 2048}.

I can understand how the use of with leads to a file being closed and the ValueError("I/O operation on closed file.") error. It is strange to me that the same pattern works with xarray but not rioxarray, and that chunking does not seem to work.

Is this just something rioxarray currently does not support?

jamespolly avatar Jul 30 '22 14:07 jamespolly

#558 should address this: https://github.com/corteva/rioxarray/issues/550#issuecomment-1187572731

snowman2 avatar Aug 15 '22 21:08 snowman2

I think using .load() would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.

From what I can tell, calling .load() is equivalent to the initial code snippet above:

    with open(filename, "rb") as of:
        data = of.read()

Since the data is lazily loaded by rioxarray.open_rasterio, calling .load() is the only way to load in the data before closing the file handle in the current state of the code.

Ideas to improve this behavior are definitely welcome. It may be worth looking into how xarray handles this scenario as they may have found a solution for this.

snowman2 avatar Aug 16 '22 14:08 snowman2

I think using .load() would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.

From what I can tell, calling .load() is equivalent to the initial code snippet above:

    with open(filename, "rb") as of:
        data = of.read()

Yes, it is, however, the of.read() was just a means to get data into the MemoryFile to illustrate the problem.

In the real case, we want to keep things lazy as much as possible. I believe with your example here: https://github.com/corteva/rioxarray/issues/550#issuecomment-1187446210 and the fix in #550 I'm good.

Ideas to improve this behavior are definitely welcome. It may be worth looking into how xarray handles this scenario as they may have found a solution for this.

Yes, I think they do, but I don't remember how.

mraspaud avatar Aug 19 '22 07:08 mraspaud

I think this is resolved. If something is missing still, please comment.

snowman2 avatar Dec 19 '22 21:12 snowman2