pyogrio
pyogrio copied to clipboard
Parallel reading
I was thinking it would be cool to be able to read files in parallel, certainly in combination with https://github.com/jsignell/dask-geopandas
I didn't look much at the code of pyogrio yet, but in general, I suppose there are a few options:
- Releasing the GIL in the cython functions where needed/possible, and actually reading in parallel in multiple threads (it seems this should be possible: https://trac.osgeo.org/gdal/wiki/FAQMiscellaneous#IstheGDALlibrarythread-safe)
- Reading in multiple processes might be easier to directly try, and in principle can also integrate with dask-geopandas (only it then probably makes sense to keep using processes for further operations on the geodataframe, while with pygeos we can actually use multithreading nowadays).
Now, for both cases we also need some way to be able to read a part of the dataset, like the "slice" functionality in fiona / now exposed in geopandas. That way, you could first query to total size, end then with some heuristic/options construct appropriate slices to read. Do you already support something like this in pyogrio?
(just putting some quick thoughts here, now I was thinking about it :))
To read slices, you can use skip_features
and max_features
. This should support concurrent reads in multiple processes.
Looking for opportunities to release the GIL is a good idea.
Ah, cool, the skip_features
and max_features
should indeed provide the required functionality to read parts.
Regarding releasing the GIL, I was taking a look at the code in io.pyx, and some stumbling blocks I noticed:
-
Without the GIL, we can't use object dtype arrays. So for storing the WKB, it would need another solution:
https://github.com/brendan-ward/pyogrio/blob/d7ac3d1e973999802d95a832813deb3bd0d862ee/pyogrio/_io.pyx#L420
I assume something like storing the wkb in one long C array, and then storing the lengths of each entry separately. And then afterwards this can be converted to an object dtype array (or directly to pygeos geometries). We did something similar in the old geopandas cython branch.
-
The same as the above goes for object arrays for attribute fields (eg strings, binary). This makes it of course quite a bit more complex ..
-
To keep track of the variable (meaning: not known at compile time) number of fields of a file, you are now using a python list of numpy arrays:
https://github.com/brendan-ward/pyogrio/blob/d7ac3d1e973999802d95a832813deb3bd0d862ee/pyogrio/_io.pyx#L425-L428
Which is a nice solution, but also something that won't directly work in a
nogil
block, I think (at least the python lists won't work, could maybe use C arrays / C++ vector)
So maybe not that straightforward.
Thanks for looking in further and offering suggestions!
What do you think about using the Arrow C library and Arrow data structures as the memory transport between OGR and Python? I'm not familiar enough with that to know if it is appropriate for what we're doing here or if that would increase implementation complexity substantially. I was thinking over the long term, this would allow us to leverage data structures from geo-arrow-spec
for geometries instead of WKB.
Actually looks like we'd need pyarrow
(w/ Cython bindings) to do what we want to do here.
Basic idea would be to go from OGR records (field values + geometries) => Arrow table => pandas data frame => decode geometry => geodataframe
In theory, that would be a good fit I think (I have thought before about a GDAL-Arrow bridge, which would be really nice).
In practice, though, that will also give quite some additional complexities I think. Certainly packaging wise, both GDAL and Arrow are heavy and complex dependencies, and then building against both of them .. (for conda that will be OK, but I have no idea how to even start making a wheel for this. It would be good to see if there are other packages depending on Arrow as build dependency and make wheels. Eg turbodbc, but they don't have wheels yet).
I am not sure the pyarrow Cython bindings currently expose sufficient functionality. Arrow has a concept of "builders" to iteratively build up arrays (which would fit very well for what we do here, get record by record and build up the arrays), but not all those are currently exposed in cython (there is an open issue about it to add those). The alternative is to use the C++ API. And example of this is the above mentioned turbodbc: https://github.com/blue-yonder/turbodbc/blob/master/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp (converting ODBC result sets into arrow arrays).
Thinking about it a bit more: an interesting avenue would be to use Arrow's C data interface. Ideally, this could be a feature living in GDAL, without that they need to add Arrow as a dependency. GDAL would construct the data in Arrow's format, and expose it using that interface. Then another library can fetch those data, and use pyarrow to put them in an Arrow Table (also without a build dependency on Arrow, only runtime dependency). I think the main disadvantage of this (apart from convincing GDAL devs to support this), is that you then can't use all the helpers from Arrow C++ library to build up those arrays (the Builder classes, which can deal for you with missing values etc). For primitive types like float and ints without nulls this should still be relatively straightfoward, though (since that are plain buffers like numpy / c arrays)
I was just reading the code for nogil
blocks and saw this issue 🙂. Three years changes a lot!
The OGR docs do say
only one ArrowArrayStream can be active at a time on a given layer
Does that mean you could create two separate layer objects to refer to the same file, and then have two ArrowArrayStream
s?
More generally, is there anywhere where it makes sense to add nogil
in the ogr_read_arrow
code? I'm not entirely clear where the computation is happening between Arrow fetching each batch and OGR computing each batch
Quoting myself "Ideally, this could be a feature living in GDAL, without that they need to add Arrow as a dependency", that has aged well :slightly_smiling_face:
Does that mean you could create two separate layer objects to refer to the same file, and then have two
ArrowArrayStream
s?
I think so, yes. That's also what we already do in dask-geopandas' read_file
, but only in multiple processes instead of threads, to avoid the GIL issue. We should test changing that to use threads with the arrow reader.
Now, while this in theory makes parallel reading possible, in practice it also still depends on the exact file format whether they support this efficiently (i.e. whether they support random access, and setting the feature iterator by index (OLCFastSetNextByIndex
)).
For example, if you would do this with GeoJSON reading it in 4 chunks from 4 threads, each reader will iterate from the beginning of the file to get to the part to read anyway.
More generally, is there anywhere where it makes sense to add
nogil
in theogr_read_arrow
code? I'm not entirely clear where the computation is happening between Arrow fetching each batch and OGR computing each batch
I think the nogil is already taken care of by pyarrow. The RecordBatchReader.read_next_batch
has a nogil
context around calling into the C++ ReadNext
, which will call the next()
method of the C Stream. And I think the actual work by GDAL is being done at that point.
I think the nogil is already taken care of by pyarrow. The
RecordBatchReader.read_next_batch
has anogil
context around calling into the C++ReadNext
, which will call thenext()
method of the C Stream. And I think the actual work by GDAL is being done at that point.
Ah that makes sense. I did see the nogil
here in pyarrow. It makes sense that the original OGR_L_GetArrowStream
doesn't really do any work itself; the work is contained within each next()
call of the C stream
FYI, not 100% same as what is being discussed here, but somewhat related: for e.g. GPKG files GDAL already reads multithreaded is some circumstances:
OGR_GPKG_NUM_THREADS=value: (GDAL >= 3.8.3) Can be set to an integer or ALL_CPUS. This is the number of threads used when reading tables through the ArrowArray interface, when no filter is applied and when features have consecutive feature ID numbering. The default is the minimum of 4 and the number of CPUs. Note that setting this value too high is not recommended: a value of 4 is close to the optimal.
Source: https://gdal.org/drivers/vector/gpkg.html#configuration-options