flashtext icon indicating copy to clipboard operation
flashtext copied to clipboard

span_info on combined unicode characters

Open kkaiser opened this issue 5 years ago • 7 comments

This fixes issue: #81

Lowering a sentence with combined unicode chars changes the length of a sentence.

s = 'İ love Big Apple and Bay Area.'
len(s)  # 30
len(s.lower())  # 31

Lowering keywords and search sentence now works on a per char basis to return the correct span_info

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple', 'New York')
keyword_processor.add_keyword('Bay Area')
keyword_processor.add_keyword('İ love')
s = 'İ love Big Apple and Bay Area.'
keywords_found = keyword_processor.extract_keywords(s, span_info=True)
keywords_found
# old: [('İ love', 0, 7), ('New York', 8, 17), ('Bay Area', 22, 30)]
# new: [('İ love', 0, 6), ('New York', 7, 16), ('Bay Area', 21, 29)]
for k in keywords_found:
    print(s[k[1]:k[2]])
# new: İ love
# old: İ love
# new: Big Apple
# old: ig Apple
# new: Bay Area
# old: ay Area.

kkaiser avatar May 28 '19 10:05 kkaiser

Coverage Status

Coverage increased (+0.02%) to 99.327% when pulling 40633c9e92bbba581c3a13c4ff03ddbae449d4ae on kkaiser:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.

coveralls avatar May 28 '19 10:05 coveralls

Coverage Status

Coverage increased (+0.02%) to 99.327% when pulling 40633c9e92bbba581c3a13c4ff03ddbae449d4ae on kkaiser:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.

coveralls avatar May 28 '19 10:05 coveralls

Moving the check inside the loop ads a little of execution overhead, please share a test case for the change. Thanks

vi3k6i5 avatar May 03 '20 07:05 vi3k6i5

Can you please resolve the conflict.

vi3k6i5 avatar May 03 '20 07:05 vi3k6i5

Ready for review

kkaiser avatar May 15 '20 09:05 kkaiser

This is still happening with flashtext-2.7 Looks like the fix was never merged with master...

laurenegerton avatar Nov 16 '20 14:11 laurenegerton

@vi3k6i5 - Hi there! Is this ready for and can you complete merging this PR?

spencertollefson avatar Apr 13 '21 20:04 spencertollefson