spark-knn icon indicating copy to clipboard operation
spark-knn copied to clipboard

Nearest Neighbor between two dataframes

Open rbhatia46 opened this issue 3 years ago • 3 comments

Hi, Thanks for the amazing work! I have two dataframes, A has about 200 Million points and B has about 10 Million points, I want to find the nearest neighbor for every point in A from B, I want to do this preferably in Python, how can I achieve it using this library?

rbhatia46 avatar Sep 09 '20 06:09 rbhatia46

Hi rbhatia46, have you found a solution ? I'm facing the same problem, it will be great help! Thanks.

hexiaoyupku avatar Nov 13 '20 08:11 hexiaoyupku

Hi @hexiaoyupku , After looking at a lot of solutions, nothing worked, try looking at Pandas UDF and write a custom UDF for your nearest neighbour use-case, Pandas UDFs are much more performant than usual Spark UDFs, because they are vectorised and use Apache Arrow for optimised conversion between Python and JVM. They should be decent in terms of performance, if you still want further optimisation, then you can write your UDF in Scala(if you are familiar with it, or if your tech stack allows it), otherwise Pandas UDFs in PySpark should be fine.

rbhatia46 avatar Nov 19 '20 02:11 rbhatia46

Hi @rbhatia46 , could you give the code which uses Pandas UDF? Thanks.

wahyudierwin avatar Dec 18 '20 03:12 wahyudierwin