text-dedup icon indicating copy to clipboard operation
text-dedup copied to clipboard

ModuleNotFoundError: No module named 'text_dedup.embedders'

Open done520 opened this issue 2 years ago • 6 comments

ModuleNotFoundError: No module named 'text_dedup.embedders'

when "from text_dedup.embedders.minhash import MinHashEmbedder"

done520 avatar Sep 19 '22 12:09 done520

Can I ask what version of text_dedup you are using?

If it is installed from PyPi, then this shouldn't be an issue. But if you are using the main branch, then there are some breaking changes that the documentation hasn't caught up with, in which case, you can write:

from text_dedup.near_dedup import MinHashEmbedder

The documentation should be updated in the next couple of days.

ChenghaoMou avatar Sep 19 '22 15:09 ChenghaoMou

thanks. I would like to ask if text_dedup can be used for academic paper duplicate testing? Have you ever tried that?

done520 avatar Sep 22 '22 03:09 done520

By for academic paper duplicate testing, can you clarify what you mean exactly?

  1. Deduplicating data when the data are academic papers
  2. Using this for a research paper for testing other datasets

ChenghaoMou avatar Sep 22 '22 04:09 ChenghaoMou

@ChenghaoMou I say 'academic paper duplicate testing', same as duplicate checking of graduation thesis to determine whether the article is plagiarized. and I would lie to know the performance if you have use for academic paper duplicate testing.

done520 avatar Sep 26 '22 08:09 done520

@ChenghaoMou I say 'academic paper duplicate testing', same as duplicate checking of graduation thesis to determine whether the article is plagiarized. and I would lie to know the performance if you have use for academic paper duplicate testing.

Here are my two cents:

  1. Plagiarism can be different from deduplication (what is duplicated might not be plagiarized), especially when there can be direct quotes, paraphrasing or summarization in the review section, list of citations....
  2. Assume you figure out a way to strip noise from a paper. To perform plagiarism detection effectively, you would need a lot of papers to be indexed. And papers are usually longer than internet articles or typical NLP training input. I would expect a longer processing time as a result.

With that said, if I have a cleaned large set of papers, I would start with exact substring dedup, then near dedup, and finally semantic dedup, each having increasing level of compute need, and see where it leads me.

ChenghaoMou avatar Sep 26 '22 23:09 ChenghaoMou

I'm sorry to reply now and many thanks for your fruitful suggestions and proposals.

paperClub-hub avatar Oct 08 '22 09:10 paperClub-hub