text-dedup
text-dedup copied to clipboard
ModuleNotFoundError: No module named 'text_dedup.embedders'
ModuleNotFoundError: No module named 'text_dedup.embedders'
when "from text_dedup.embedders.minhash import MinHashEmbedder"
Can I ask what version of text_dedup
you are using?
If it is installed from PyPi, then this shouldn't be an issue. But if you are using the main
branch, then there are some breaking changes that the documentation hasn't caught up with, in which case, you can write:
from text_dedup.near_dedup import MinHashEmbedder
The documentation should be updated in the next couple of days.
thanks. I would like to ask if text_dedup can be used for academic paper duplicate testing? Have you ever tried that?
By for academic paper duplicate testing
, can you clarify what you mean exactly?
- Deduplicating data when the data are academic papers
- Using this for a research paper for testing other datasets
@ChenghaoMou I say 'academic paper duplicate testing', same as duplicate checking of graduation thesis to determine whether the article is plagiarized. and I would lie to know the performance if you have use for academic paper duplicate testing.
@ChenghaoMou I say 'academic paper duplicate testing', same as duplicate checking of graduation thesis to determine whether the article is plagiarized. and I would lie to know the performance if you have use for academic paper duplicate testing.
Here are my two cents:
- Plagiarism can be different from deduplication (what is duplicated might not be plagiarized), especially when there can be direct quotes, paraphrasing or summarization in the review section, list of citations....
- Assume you figure out a way to strip noise from a paper. To perform plagiarism detection effectively, you would need a lot of papers to be indexed. And papers are usually longer than internet articles or typical NLP training input. I would expect a longer processing time as a result.
With that said, if I have a cleaned large set of papers, I would start with exact substring dedup, then near dedup, and finally semantic dedup, each having increasing level of compute need, and see where it leads me.
I'm sorry to reply now and many thanks for your fruitful suggestions and proposals.