sparse_dot_topn icon indicating copy to clipboard operation
sparse_dot_topn copied to clipboard

How to integrate this package into pyspark?

Open sheboke93 opened this issue 5 years ago • 2 comments

Does this package fit in pyspark? Currently, I use pyspark sparse vector to build the csr_matrix and use this package to get the cosine result. But after that, I don't know how to use this matrix to map back the result. I try to use numpy to remap it, but not sure whether it is the best way. Could you recommend some ways to deal with this? Or could you provide some code sample for using this package in pyspark?

sheboke93 avatar Jun 23 '20 18:06 sheboke93

@sheboke93 this package works with pyspark.

You can check this video https://databricks.com/session/large-scale-fuzzy-name-matching-with-a-custom-ml-pipeline-in-batch-and-streaming The spark integration is at around 10:00.

ymwdalex avatar Jun 23 '20 20:06 ymwdalex

@sheboke93 this package works with pyspark.

You can check this video https://databricks.com/session/large-scale-fuzzy-name-matching-with-a-custom-ml-pipeline-in-batch-and-streaming The spark integration is at around 10:00.

Hi Zhe, thanks for sharing. It was a fascinating presentation. However, I still have some questions on the implementation details. Do you mind answering me this two question?

  1. When you implement the batch process how do you generate the csr_matrix for ground truth and raw name? What I currently do is extracting the information in spark sparse vector and use the Scipy csr_matrix function to generate it. Is that the right way?
  2. When you get the cosine similarity matrix how do you map it to the original data in each partition? If you can answer my question in your spare time I will really really appreciate it!

sheboke93 avatar Jun 23 '20 20:06 sheboke93