flashtext
flashtext copied to clipboard
span_info on combined unicode character(s)
Hello,
I encountered an issue with span_info=True
when used on a string with combined characters. As demonstration consider the following example:
import re
from flashtext import KeywordProcessor
from unicodedata import normalize
from unidecode import unidecode
s = KeywordProcessor()
s.set_non_word_boundaries('_')
k = 'afa'
s.add_keyword(k)
t = 'İlgili muhafaza'
t2 = unidecode(t)
t3 = normalize('NFD', t)
r = s.extract_keywords(t, span_info=True)
r2 = s.extract_keywords(t2, span_info=True)
r3 = s.extract_keywords(t3, span_info=True)
(
t, # ('İlgili muhafaza',
len(t), # 15,
r, # [('afa', 11, 14)],
t[r[0][1]:r[0][2]], # 'faz',
re.search(k, t), # <re.Match object; span=(10, 13), match='afa'>,
t2, # 'Ilgili muhafaza',
len(t2), # 15,
r2, # [('afa', 10, 13)],
t2[r2[0][1]:r2[0][2]], # 'afa',
t3, # 'İlgili muhafaza',
len(t3), # 16,
r3, # [('afa', 11, 14)],
t3[r3[0][1]:r3[0][2]], # 'afa')
)
The expected behaviour is that span start and end return the same as re
without having to normalise the string. The issue is especially annoying when the returned start or end is greater than len(string)
.