pyogrio icon indicating copy to clipboard operation
pyogrio copied to clipboard

ENH: support reading from in-memory (byte) objects

Open jorisvandenbossche opened this issue 3 years ago • 3 comments

Currently this is not yet supported (only string paths are accepted right now).

With geopandas using fiona you can currently do:

file_bytes = io.BytesIO(open(file_path, "rb").read())
geopandas.read_file(file_bytes)

which is implemented using fiona's BytesCollection. In the end, this seems to convert the bytes buffer to a "virtual file" with VSIFileFromMemBuffer: https://github.com/Toblerity/Fiona/blob/a6ed5b2e6972e4dad438b85e6f4b4ac8db1154c6/fiona/ogrext.pyx#L1863-L1879

jorisvandenbossche avatar Nov 05 '21 12:11 jorisvandenbossche

Although we use BytesCollection in the geopandas read_file implementation, fiona actually also has a MemoryFile, which seems a more advanced implementation of this. Without further looking, I don't directly know if we would actually need the more complex implementation in pyogrio (I suppose we for example care less about actually exposing such a class to the user).

jorisvandenbossche avatar Nov 05 '21 13:11 jorisvandenbossche

From a quick read over fiona's MemoryFile, it looks like part of the complexity comes from having to deal with the requirements of specific drivers (probably more on the writing side). It also supports writing to the MemoryFile then reading out the associated bytes, which could be useful in certain contexts.

For pyogrio, I think we want to continue to avoid stateful classes like MemoryFile. Instead I think we'd want to approach this as a one-shot operation. For read, this would depend on detecting incoming bytes buffer rather than filename (and also detection if zipped) and could get the VSI handle for the buffer, read it, then destroy it.

For writing, perhaps we would pass in a BytesIO buffer for it to write to instead of a file path.

brendan-ward avatar Nov 10 '21 13:11 brendan-ward

Indeed, for reading it's also my conclusion that the simpler version (similar to the BytesCollection in fiona, as referenced in the top post) should be sufficient for our use case.

I am wondering if it would be possible to connect a python file-like object with a VSIVirtualHandle (https://gdal.org/api/cpl_cpp.html#classVSIVirtualHandle), so you could avoid reading the full file-like object into memory as bytes, but forwarding the C "read"/"seek" calls into python read/seek calls. But that's for a different issue (and also not what fiona's MemoryFile is enabling). EDIT: this seems exactly what is being explored in rasterio: https://github.com/rasterio/rasterio/pull/2141

jorisvandenbossche avatar Nov 10 '21 13:11 jorisvandenbossche

Reading from an in-memory buffer was implemented in https://github.com/geopandas/pyogrio/pull/25, so we can close this. More general reading from a file-like object is covered by #42, and we can open a separate issue or writing to in-memory or file-like objects(-> https://github.com/geopandas/pyogrio/issues/249).

jorisvandenbossche avatar Apr 30 '23 14:04 jorisvandenbossche