spark-knn
spark-knn copied to clipboard
Nearest Neighbor between two dataframes
Hi, Thanks for the amazing work! I have two dataframes, A has about 200 Million points and B has about 10 Million points, I want to find the nearest neighbor for every point in A from B, I want to do this preferably in Python, how can I achieve it using this library?
Hi rbhatia46, have you found a solution ? I'm facing the same problem, it will be great help! Thanks.
Hi @hexiaoyupku , After looking at a lot of solutions, nothing worked, try looking at Pandas UDF and write a custom UDF for your nearest neighbour use-case, Pandas UDFs are much more performant than usual Spark UDFs, because they are vectorised and use Apache Arrow for optimised conversion between Python and JVM. They should be decent in terms of performance, if you still want further optimisation, then you can write your UDF in Scala(if you are familiar with it, or if your tech stack allows it), otherwise Pandas UDFs in PySpark should be fine.
Hi @rbhatia46 , could you give the code which uses Pandas UDF? Thanks.