flashtext icon indicating copy to clipboard operation
flashtext copied to clipboard

Support for Devanagri and Indian Languages

Open srnthsrdhrn opened this issue 3 years ago • 0 comments

Hi. First of all, I would like to thank you for creating such a wonderful library. Really helps me a lot.

I am trying to use this for Devanagri (the script for Hindi) specifically, where I am facing issues.

The issue is when I am trying to extract keywords from a particular string, even strings containing that keyword as substrings are getting selected.

Example:

If I am searching for "Pam" I am also getting "Pamella".

From my rough understanding of the underlying algorithm, these cases ideally shouldn't occur.

So I am assuming this is something to do with the script of the text. Do we have a solution for this?

I came across this issue with Chinese: https://github.com/vi3k6i5/flashtext/issues/43

Where you mentioned an absence of proper tokenization for the language is the issue. If that is the case here, I should be able to help in that regard.

For people who are coming to this issue for a solution, I am temporarily using a hack to get around this,

I use flashtext to extract the keywords and use the regex library to search for only those extracted keywords. Regex has support for unicode scripts and hence the regex expressions with word boundaries work for me. So flashtext kind of reduces the search space for me, and regex is able to give good turnaround times there.

srnthsrdhrn avatar May 06 '21 09:05 srnthsrdhrn