flashtext internationalize word boundary checks

internationalize word boundary checks

Open aseifert opened this issue 6 years ago • 5 comments

Hi there,

I think the only safe way to deal with issue #48 would be to test against the \W class [1]. Judging from the benchmarks linked on https://github.com/vi3k6i5/flashtext#why-not-regex this seems to run slower by a factor of 1-2 though.

Best, Alex

[1] Quoting the Python docs:

\b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

Mar 19 '18 22:03 aseifert

Coverage increased (+0.7%) to 100.0% when pulling 9b6b187b2b67ad279092d3f36f3dd4d64b8994a9 on aseifert:master into 5591859aabe3da37499a20d0d0d6dd77e480ed8d on vi3k6i5:master.

Mar 19 '18 22:03 coveralls

Coverage increased (+0.7%) to 100.0% when pulling 9b6b187b2b67ad279092d3f36f3dd4d64b8994a9 on aseifert:master into 5591859aabe3da37499a20d0d0d6dd77e480ed8d on vi3k6i5:master.

Mar 19 '18 22:03 coveralls

Another way, based on https://stackoverflow.com/a/2998550:

def is_word_char(c, _categories=frozenset({'Ll', 'Lu', 'Lt', 'Lo', 'Lm', 'Nd', 'Pc'})):
    return unicodedata.category(c) in _categories

Jun 08 '19 05:06 ioistired

Another way to do it:

from functools import lru_cache

from flashtext import KeywordProcessor


class NonWordBoundaries:
    def __init__(self, *predicates):
        self.predicates = predicates

    @lru_cache(maxsize=128)
    def __contains__(self, ch):
        for predicate in self.predicates:
            if predicate(ch):
                return True
        return False


def main():
    words_to_search = ["рок"]

    keyword_processor = KeywordProcessor()
    keyword_processor.set_non_word_boundaries(NonWordBoundaries(str.isalpha, str.isdigit))
    keyword_processor.add_keywords_from_list(words_to_search)
    keywords_found = keyword_processor.extract_keywords('рок порок роковой')
    print(keywords_found)

Not sure about performance though. But at least it is easy to modify the behaviour.

Feb 21 '20 10:02 senpos

Benchmarks vs. Regex are for the English only char set. Is increasing the word boundaries like this effecting flashtext performance in any significant way?

May 14 '20 16:05 alexpeaceca

flashtext flashtext copied to clipboard

internationalize word boundary checks

flashtext
flashtext copied to clipboard