text-dedup
text-dedup copied to clipboard
Can we use it for Arabic text?
trafficstars
I am testing it for Arabic texts, however I see it remove all the data and check it as duplicated even with threshold 95%.
Arabic text might require a different tokenisation method. Feel free to change the source file you are using. e.g. https://github.com/ChenghaoMou/text-dedup/blob/85dd9272e2cc0e873b1abb556807e1596f722284/text_dedup/minhash.py#L121 for minhash.py.
Stale issue message