flashtext span_info on combined unicode characters

span_info on combined unicode characters

Open kkaiser opened this issue 5 years ago • 7 comments

This fixes issue: #81

Lowering a sentence with combined unicode chars changes the length of a sentence.

s = 'İ love Big Apple and Bay Area.'
len(s)  # 30
len(s.lower())  # 31

Lowering keywords and search sentence now works on a per char basis to return the correct span_info

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple', 'New York')
keyword_processor.add_keyword('Bay Area')
keyword_processor.add_keyword('İ love')
s = 'İ love Big Apple and Bay Area.'
keywords_found = keyword_processor.extract_keywords(s, span_info=True)
keywords_found
# old: [('İ love', 0, 7), ('New York', 8, 17), ('Bay Area', 22, 30)]
# new: [('İ love', 0, 6), ('New York', 7, 16), ('Bay Area', 21, 29)]
for k in keywords_found:
    print(s[k[1]:k[2]])
# new: İ love
# old: İ love
# new: Big Apple
# old: ig Apple
# new: Bay Area
# old: ay Area.

May 28 '19 10:05 kkaiser

Coverage increased (+0.02%) to 99.327% when pulling 40633c9e92bbba581c3a13c4ff03ddbae449d4ae on kkaiser:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.

May 28 '19 10:05 coveralls

Coverage increased (+0.02%) to 99.327% when pulling 40633c9e92bbba581c3a13c4ff03ddbae449d4ae on kkaiser:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.

May 28 '19 10:05 coveralls

Moving the check inside the loop ads a little of execution overhead, please share a test case for the change. Thanks