flashtext
flashtext copied to clipboard
Replacing text without word boundary markers
Is it possible to find and replace a sentence without word boundary markers?
This kind of problem is very common in many East Asian languages such as Thai, Chinese and Japanese. These words are typically written together without word boundary markers. For simplicity, let's me give an example in English.
Example in English
test_dict = ["This","is","an","example"]
text = "Thisisanexample"
expected output : <mark>This</mark><mark>is</mark><mark>an</mark><mark>example</mark>
Currently, I am using Regex and found it is very slow to process the entire corpus because I have more than 600K words in a dictionary. I am looking for an algorithm that can run faster than Regex.
1.Regex
import re
namesRegex = re.compile(r'(' + '|'.join(test_dict) + ')', re.I)
replaced = namesRegex.sub(r'<mark>\1</mark>', text)
print(replaced)
Output
`<mark>This</mark><mark>is</mark><mark>an</mark><mark>example</mark>`
2.Flashtext
from flashtext import KeywordProcessor
processor = KeywordProcessor()
for word in test_dict:
processor.add_keyword(word,"<mark>"+word+"</mark>")
#print(word,":","<mark>"+word+"</mark>")
found = processor.replace_keywords(text)
print(found)
Output
`Thisisanexample`