flashtext span_info on combined unicode character(s)

span_info on combined unicode character(s)

Open kkaiser opened this issue 5 years ago • 0 comments

Hello,

I encountered an issue with span_info=True when used on a string with combined characters. As demonstration consider the following example:

import re

from flashtext import KeywordProcessor
from unicodedata import normalize
from unidecode import unidecode

s = KeywordProcessor()
s.set_non_word_boundaries('_')
k = 'afa'
s.add_keyword(k)

t = 'İlgili muhafaza'
t2 = unidecode(t)
t3 = normalize('NFD', t)
r = s.extract_keywords(t, span_info=True)
r2 = s.extract_keywords(t2, span_info=True)
r3 = s.extract_keywords(t3, span_info=True)

(
    t,                     # ('İlgili muhafaza',
    len(t),                # 15,
    r,                     # [('afa', 11, 14)],
    t[r[0][1]:r[0][2]],    # 'faz',
    re.search(k, t),       # <re.Match object; span=(10, 13), match='afa'>,
    t2,                    # 'Ilgili muhafaza',
    len(t2),               # 15,
    r2,                    # [('afa', 10, 13)],
    t2[r2[0][1]:r2[0][2]], # 'afa',
    t3,                    # 'İlgili muhafaza',
    len(t3),               # 16,
    r3,                    # [('afa', 11, 14)],
    t3[r3[0][1]:r3[0][2]], # 'afa')
)

The expected behaviour is that span start and end return the same as re without having to normalise the string. The issue is especially annoying when the returned start or end is greater than len(string).

May 16 '19 12:05 kkaiser

flashtext flashtext copied to clipboard

span_info on combined unicode character(s)

flashtext
flashtext copied to clipboard