datasketch
datasketch copied to clipboard
Does there have any example to remove duplicate docs using MinHash?
Does there have any example to remove duplicate docs using MinHash?
Not yet. But maybe you can create one and add it as a pull request :)
I would start from creating MinHash of normalized (tokenized, lowercased, truncated, etc.) documents. Once you have N MinHash for N documents, you have two choices:
- Use brute force to compute the Jaccard similarity of all pairs of MinHash to find documents that have very high similarity (e.g., >0.95 Jaccard)
- Use MinHashLSH index. Insert all the MinHash into the index, and then query each MinHash to find highly similar candidates (except itself), compute their Jaccard similarities, and (exactly, or use MinHash), and find ones with very high similarity.
The second option is faster, the first option is more accurate.
Hi @ekzhu, I would like to work on this. I have already built something similar for my use-case where I have to deduplicate a huge corpus of almost 100M documents. I am using the first approach, I had tried the second one but I was using multiprocessing to achieve parallelism. In MinHashLSH I was not able to merge all the object created into different processes into one. So, I would like to know which approach should we move ahead with for this one.
Hi @ekzhu, I would like to work on this. I have already built something similar for my use-case where I have to deduplicate a huge corpus of almost 100M documents. I am using the first approach, I had tried the second one but I was using multiprocessing to achieve parallelism. In MinHashLSH I was not able to merge all the object created into different processes into one. So, I would like to know which approach should we move ahead with for this one.
Sounds good. I believe this also addresses https://github.com/ekzhu/datasketch/issues/205. You can submit a PR and we can go from there.
I am planning to work on this project in my free time. So, a few questions
- Do we need to add it as a class method attached to the MinHash class, since it will be a util kind of method (good to have) so, could we keep it separate?
- The other one was, which method do you think? should we go for since #205 is implemented so we can choose either of the methods? Since it is going to be a tradeoff between speed and accuracy.
If you have any other suggestions, please let me know. @ekzhu