text-dedup icon indicating copy to clipboard operation
text-dedup copied to clipboard

Can we use it for Arabic text?

Open hahmad2008 opened this issue 1 year ago • 1 comments
trafficstars

I am testing it for Arabic texts, however I see it remove all the data and check it as duplicated even with threshold 95%.

hahmad2008 avatar Jun 03 '24 13:06 hahmad2008

Arabic text might require a different tokenisation method. Feel free to change the source file you are using. e.g. https://github.com/ChenghaoMou/text-dedup/blob/85dd9272e2cc0e873b1abb556807e1596f722284/text_dedup/minhash.py#L121 for minhash.py.

ChenghaoMou avatar Jun 03 '24 14:06 ChenghaoMou

Stale issue message

github-actions[bot] avatar Aug 02 '24 17:08 github-actions[bot]