datasketch icon indicating copy to clipboard operation
datasketch copied to clipboard

Does there have any example to remove duplicate docs using MinHash?

Open zyh3826 opened this issue 2 years ago • 4 comments

Does there have any example to remove duplicate docs using MinHash?

zyh3826 avatar Jun 17 '22 08:06 zyh3826

Not yet. But maybe you can create one and add it as a pull request :)

I would start from creating MinHash of normalized (tokenized, lowercased, truncated, etc.) documents. Once you have N MinHash for N documents, you have two choices:

  1. Use brute force to compute the Jaccard similarity of all pairs of MinHash to find documents that have very high similarity (e.g., >0.95 Jaccard)
  2. Use MinHashLSH index. Insert all the MinHash into the index, and then query each MinHash to find highly similar candidates (except itself), compute their Jaccard similarities, and (exactly, or use MinHash), and find ones with very high similarity.

The second option is faster, the first option is more accurate.

ekzhu avatar Jun 20 '22 22:06 ekzhu

Hi @ekzhu, I would like to work on this. I have already built something similar for my use-case where I have to deduplicate a huge corpus of almost 100M documents. I am using the first approach, I had tried the second one but I was using multiprocessing to achieve parallelism. In MinHashLSH I was not able to merge all the object created into different processes into one. So, I would like to know which approach should we move ahead with for this one.

rupeshkumaar avatar Oct 27 '23 19:10 rupeshkumaar

Hi @ekzhu, I would like to work on this. I have already built something similar for my use-case where I have to deduplicate a huge corpus of almost 100M documents. I am using the first approach, I had tried the second one but I was using multiprocessing to achieve parallelism. In MinHashLSH I was not able to merge all the object created into different processes into one. So, I would like to know which approach should we move ahead with for this one.

Sounds good. I believe this also addresses https://github.com/ekzhu/datasketch/issues/205. You can submit a PR and we can go from there.

ekzhu avatar Dec 01 '23 06:12 ekzhu

I am planning to work on this project in my free time. So, a few questions

  1. Do we need to add it as a class method attached to the MinHash class, since it will be a util kind of method (good to have) so, could we keep it separate?
  2. The other one was, which method do you think? should we go for since #205 is implemented so we can choose either of the methods? Since it is going to be a tradeoff between speed and accuracy.

If you have any other suggestions, please let me know. @ekzhu

rupeshkumaar avatar Mar 12 '24 17:03 rupeshkumaar