Question: How to improve KNN performance on big datasets ?

Open CommanderWahid opened this issue 10 months ago • 1 comments

Hello,

I am working on building a geospatial platform, and one of our use cases is to generate the K nearest neighbours (k=1) between two datasets.

I come across Mosaic’s SpatialKNN implementation and I've tested it.

On the landmark side, I have 90 million rows (100% multipolygons), and on the candidate side, there are 5 million rows (80% linestrings and 20% multipolygons). The index resolution is set to 10, and I’m running the code on Databricks DBR 13.3 LTS. Cluster spec: 64 Cores / 256 GB Memory (delta cache accelerated enabled).

I managed to get accurate results when I limited the landmark dataset to 100 rows and included all data from the candidate dataset.

However, when I used the full landmark dataset, the job couldn’t finish, and I had to cancel it.

I suspect that the slowness is due to handling the multipolygons:

The grid_tessellateexplode of multipolygons takes a significant amount of time. On the candidate dataset (with just linestrings) it tooks 4 minutes, while by adding multipolygons rows to linetsrings, it tooks 30 minutes.
The job hangs on, the first iteration, the grid_geometrykringexplode step of the landmark dataset.

On the Spark UI, I don’t see any issues that could guide me toward a solution.

Reducing the number of iterations doesn't help because the job couldn't even manage to finish the first iteration. Also, reducing the index resolution to 8, decreased the candidates dataset preparation time by 5 minutes, but had no impact on the landmark side.

Do you have please any recommendations regarding the cluster sizing or other optimizations that could improve the performance ?

Feb 24 '25 13:02 CommanderWahid

Hi Wahid - thanks for the detailed explanation of the issue.

The grid_tessellateexplode of multipolygons takes a significant amount of time

Could you unpack those multipolys using st_dump() / flatten_polygons() and repeat to see if that helps in the indexing step?

My second question would be: how large and complex are the geometries in the 'landmark' set? Perhaps we can reduce the complexity using st_simplify() then try again? You'll need a bit of trial and error to find a tolerance that doesn't ruin the accuracy of the polygon boundaries.

Finally, how is the landmarks dataframe partitioned? My intuition is that the size of the partitions in this dataframe will have some impact on the speed of the indexing step. Perhaps @milos-colic might be able to clarify on that one?

Feb 26 '25 10:02 sllynn