flashtext
flashtext copied to clipboard
span_info on combined unicode characters
This fixes issue: #81
Lowering a sentence with combined unicode chars changes the length of a sentence.
s = 'İ love Big Apple and Bay Area.'
len(s) # 30
len(s.lower()) # 31
Lowering keywords and search sentence now works on a per char basis to return the correct span_info
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple', 'New York')
keyword_processor.add_keyword('Bay Area')
keyword_processor.add_keyword('İ love')
s = 'İ love Big Apple and Bay Area.'
keywords_found = keyword_processor.extract_keywords(s, span_info=True)
keywords_found
# old: [('İ love', 0, 7), ('New York', 8, 17), ('Bay Area', 22, 30)]
# new: [('İ love', 0, 6), ('New York', 7, 16), ('Bay Area', 21, 29)]
for k in keywords_found:
print(s[k[1]:k[2]])
# new: İ love
# old: İ love
# new: Big Apple
# old: ig Apple
# new: Bay Area
# old: ay Area.
Coverage increased (+0.02%) to 99.327% when pulling 40633c9e92bbba581c3a13c4ff03ddbae449d4ae on kkaiser:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.
Coverage increased (+0.02%) to 99.327% when pulling 40633c9e92bbba581c3a13c4ff03ddbae449d4ae on kkaiser:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.
Moving the check inside the loop ads a little of execution overhead, please share a test case for the change. Thanks
Can you please resolve the conflict.
Ready for review
This is still happening with flashtext-2.7 Looks like the fix was never merged with master...
@vi3k6i5 - Hi there! Is this ready for and can you complete merging this PR?