sparse_dot_topn How to integrate this package into pyspark?

How to integrate this package into pyspark?

Open sheboke93 opened this issue 5 years ago • 2 comments

Does this package fit in pyspark? Currently, I use pyspark sparse vector to build the csr_matrix and use this package to get the cosine result. But after that, I don't know how to use this matrix to map back the result. I try to use numpy to remap it, but not sure whether it is the best way. Could you recommend some ways to deal with this? Or could you provide some code sample for using this package in pyspark?

Jun 23 '20 18:06 sheboke93

@sheboke93 this package works with pyspark.

You can check this video https://databricks.com/session/large-scale-fuzzy-name-matching-with-a-custom-ml-pipeline-in-batch-and-streaming The spark integration is at around 10:00.

Jun 23 '20 20:06 ymwdalex

@sheboke93 this package works with pyspark.

You can check this video https://databricks.com/session/large-scale-fuzzy-name-matching-with-a-custom-ml-pipeline-in-batch-and-streaming The spark integration is at around 10:00.

Hi Zhe, thanks for sharing. It was a fascinating presentation. However, I still have some questions on the implementation details. Do you mind answering me this two question?

When you implement the batch process how do you generate the csr_matrix for ground truth and raw name? What I currently do is extracting the information in spark sparse vector and use the Scipy csr_matrix function to generate it. Is that the right way?
When you get the cosine similarity matrix how do you map it to the original data in each partition? If you can answer my question in your spare time I will really really appreciate it!

Jun 23 '20 20:06 sheboke93

sparse_dot_topn sparse_dot_topn copied to clipboard

How to integrate this package into pyspark?

sparse_dot_topn
sparse_dot_topn copied to clipboard