geopandas
geopandas copied to clipboard
QST: clip runs so slow
Is there an alternative method that I can speed up running geopandas.clip function. I am trying to clip linestrings from polygons. Both input data are quite large e.g. Polygons 86 million and linestring 1 million. I have created spatial index but it is still very slow even when running some small subsets. Any suggestions?
I would recommend looking at dask-geopandas and distributed solution it offers - https://dask-geopandas.readthedocs.io/en/stable/docs/reference/api/dask_geopandas.clip.html. 86 million polygons is ... a lot. A lot lot.
I assume part of the bottleneck here is that you are clipping with a mask consisting of many geometries. Our current implementation is converting that to a single geometry object using unary_union, and this is known to be a (inherently) slow operation.
That part will also not be helped by using dask-geopandas (and that's actually even not yet implemented, for dask-geopandas you need to create that union geometry yourself).
It might be worth exploring if we can have an alternative clip logic where we don't create this unary_union, but using a spatial index check for each feature to be clipped with which geometries of the mask it is intersecting, and then calculating the intersection that way.
Sidenote: best ensure you have pygeos installed before considering other ways to improve performance (https://geopandas.org/en/latest/getting_started/install.html#using-the-optional-pygeos-dependency)
xref https://github.com/geopandas/geopandas/issues/1803 (though the original function offered there still uses unary_union, and the overlay suggestion provided by @jorisvandenbossche was that fastest solution to that problem)
@knaaptime good catch, thanks for the link, I forgot about that one. It is indeed about the same bottleneck, and @KarenChen9999 it might be interesting to test the different alternative solutions mentioned there
Thanks so much for all the good suggestions. I have created sindex and also installed pygeos. I also tried to use sjoin to work out corresponding intersecting ids, then only apply clip with those matched subsets but it still runs slow with clip function.
I initially tried overlay function, but I misplaced the order of linestring and polygon e.g. geopandas.overlay(polygon, linestring), which returned an empty output. So I misunderstood this function only works for the same type of geo objects. After switching the order e.g. geopandas.overlay(linestring, polygon) and then applying dissolve it worked perfectly. It runs a lot faster than clip. Note there is no differences in running time with/without creating sindex. It only takes 1 minute with my testing dataset 2.3 million linestring and 100,000 polygons, while before it took the same time for just one polygon using clip.
I have an issue that seemed to be related to this. I am clipping ~2 million grid cells (represented as polygons) with one single polygon mask. I've run this many times before with geopandas clip operation and it always executed on the order of seconds. But recently, after some different package installations in my conda environment, the clip operation (and some other geopandas operations as well, such as drop_duplicates) is EXTREMELY SLOW. I'm talking >10 mins slow for the same operation that took less than 5 seconds before, on the same machine. Any ideas what is causing this extreme change? I'm honestly very perplexed and I don't know where to start to troubleshoot this.
@jorisvandenbossche @martinfleis If you have any suggestions or comments, it would be greatly appreciated!
@josephko91 are you able to figure out what has changed in your environment?
@martinfleis Yes, I used conda list --revisions to check the changes to my environment than I used conda install --revision N to revert back to when I recall it was still working quickly. This did not solve the problem. I also tried fresh install of GeoPandas in a new Conda environment as well. Didn't work.
So even if you run the same code in the same environment, it now runs much slower than before? No idea how to debug that...
@martinfleis Yes, it is quite perplexing. From a big picture perspective, I'm confused why this would take over 10 minutes, even with a fresh install of GeoPandas in a new environment. Yes, 2 million grid cells may be "a lot" but should it really be taking that long to clip those grid cells with a single polygon? That is what is confusing to me. For example, if I did this in QGIS or ArcGIS I know it would happen very quickly.
@josephko91 to confirm: do you have PyGEOS or Shapely 2.0b1 or 2.0b2 installed in your updated environment?
@josephko91 well, to be fair, our implementation is not ideal as you can understand from the discussion here and in #1803 linked above. We currently do intersection even on polygons that are fully within a mask and a use of spatial index is not super efficient.
@brendan-ward points in a right direction. What may be happening is that before, GeoPandas was using PyGEOS engine and now is using shapely, maybe due to some installation problem on the PyGEOS side. However, that should not happen in a fresh environment.
@brendan-ward In my new environment, I have pygeos 0.13 and shapely 1.8.5 installed. Full list of packages in my conda environment attached. geo_test_packages.txt
@martinfleis Understood. I'm just trying to figure out why it seemed to be working much faster previously. Will do some more tests and let you know if I figure out the root issue...
Would it help if I create and share a minimal reproducible example with the grid and mask vector files?