Casper van der Wel
Casper van der Wel
I have been reading up on this issue. Columnar access (reading) would be a major performance increase for applications of pygeos and specifically some applications I will be working on...
> Only boxes are in principle a bit simpler, but on the other hand, storing it as a geoseries might make working with it easier. And is also more general...
> 1) we don't have a regular grid with rectangles (in dask, the chunksize can vary in each dimension, but it's still are all rectangles in a grid, I think)...
Interesting discussion indeed! > In your two options, you seem to start from the spatial extent, and then determine which geometries belong to that spatial partition (either all geometries intersecting...
Two options I encountered: - An extension of Hadoop-GIS (SATO) is sampling 1-3% the geometries to establish a partitioning scheme: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.687.6157&rep=rep1&type=pdf . This approach could be implemented in dask (parallelized)...
> I also want to emphasize again that it is really not straightforward to have "duplicated" rows (to support non-overlapping spatial partitions). That will certainly be an interesting model for...
I have had good experience in letting postgis handle the partitioning. In general, the partitioning should be done right at the IO level, either by having the data preprocessed in...
Thanks for the fast response and benchmark! I must say I am no frequent user of distributed workloads of geometries. Maybe @jorisvandenbossche can judge if the 70% improvement is worth...
In that case the serialised form of a tree could be a vector of envelope corner coordinates ( `(x1, y1, x2, y2) * n`). That would be a factor 2-3...