flashtext icon indicating copy to clipboard operation
flashtext copied to clipboard

Replacing text without word boundary markers

Open Ekkalak-T opened this issue 6 years ago • 0 comments

Is it possible to find and replace a sentence without word boundary markers?

This kind of problem is very common in many East Asian languages such as Thai, Chinese and Japanese. These words are typically written together without word boundary markers. For simplicity, let's me give an example in English.

Example in English

test_dict = ["This","is","an","example"]
text = "Thisisanexample"
expected output : <mark>This</mark><mark>is</mark><mark>an</mark><mark>example</mark>

Currently, I am using Regex and found it is very slow to process the entire corpus because I have more than 600K words in a dictionary. I am looking for an algorithm that can run faster than Regex.

1.Regex

import re
namesRegex = re.compile(r'(' + '|'.join(test_dict) + ')', re.I)
replaced = namesRegex.sub(r'<mark>\1</mark>', text)
print(replaced)
     Output
    `<mark>This</mark><mark>is</mark><mark>an</mark><mark>example</mark>`

2.Flashtext

from flashtext import KeywordProcessor
processor = KeywordProcessor()
for word in test_dict:
    processor.add_keyword(word,"<mark>"+word+"</mark>")
    #print(word,":","<mark>"+word+"</mark>")

found = processor.replace_keywords(text)
print(found)
   Output
  `Thisisanexample`

Ekkalak-T avatar Mar 28 '18 08:03 Ekkalak-T