flashtext
flashtext copied to clipboard
KeywordProcessor returns wrong span for text containing non-ascii characters when case_sentsitive=False
Hi all, first thanks a lot for the great library you created, I really appreciate it!
When working with non-ascii characters I found a case, where the span returned by the KeywordProcessor
is wrong, when case_sentsitive=False
.
Please find a sample below that reproduces the error:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keyword('Bay Area')
text = 'İ I love big Apple and Bay Area.' # added the "İ" non-ascii character
keywords_found = keyword_processor.extract_keywords(text, span_info=True)
for match in keywords_found:
print(match)
print(text[match[1]:match[2]])
Output:
('Bay Area', 24, 32) ay Area. # the span is shifted by one
When looking in the error, I figured out, that the length of the “İ” changes from 1 (when uppercase) to 2 (when lowercase), which I believe results in the span shift (because the span is only wrong when non-case sensitive).
len("İ") Out[39]: 1
len("İ".lower()) Out[40]: 2
Could any of the authors comment on the issue and mention, if they intent to do something about it or if it is out of scope?
Thanks a lot!
Hey Mauro, it doesn't look like the repo is being actively maintained these days. As a pet project, I was going to go through the codebase and give this a revamp, and given this issue is not exceptionally common, non-ascii character or otherwise, what I've done to address the issues amounts to the following:
- inserting some thoughtfully-place if statements to catch instances where the lengths differ over lowercasing, and raise a ValueError in such cases.
- ensure appropriate text normalisation prior to inputting the text as an argument to functions which make use of lowercasing.
In such instances, the onus is usually on the user to make sure the text is normalised, and this is fundamentally a text cleanliness issue, rather than an issue with calculating the spans, which thus far looks to be behaving as it should in this case. If you modify the length of the string part way through, I would consider raising an error to be sensible and block the span from calculating an incorrect value.