flashtext icon indicating copy to clipboard operation
flashtext copied to clipboard

[bug] set of word boundary characters too restrictive

Open aseifert opened this issue 6 years ago • 1 comments

Hello there,

first of all: thanks for the amazing algorithm, it's really useful!

It turns out you use only a very restrictive set of characters as non_word_boundaries. For many languages this poses a problem. E.g. in German:

from flashtext import KeywordProcessor
kwp = KeywordProcessor()
kwp.add_keyword("lt.")
kwp.extract_keywords("Damit galt es als so gut wie fix, dass Vueling den Zuschlag erhält.")
# i would expect this to be empty

The problem can be fixed (for German) by adjusting the property non_word_boundaries:

kwp.non_word_boundaries = kwp.non_word_boundaries.union(list("ÖÄÜöäüß"))

Would you consider internationalizing the word boundaries or is this restrictive behavior on purpose?

Thanks, Alex

aseifert avatar Mar 19 '18 15:03 aseifert

Hi Alex,

I know English and hence couldn't make it work for other languages because I won't be able to understand/test the functioning.

Would you consider internationalizing the word boundaries or is this restrictive behavior on purpose?

I would consider but I don't know how. You are free to make changes that make sense to you.

Please send pull request we test cases if possible. Would really appreciate that :)

Thanks, Vikash

On Mon, Mar 19, 2018 at 9:11 PM Alexander Seifert [email protected] wrote:

Hello there,

first of all: thanks for the amazing algorithm, it's really useful!

It turns out you use only a very restrictive set of characters as non_word_boundaries. For many languages this poses a problem. E.g. in German:

from flashtext import KeywordProcessor kwp = KeywordProcessor() kwp.add_keyword("lt.") kwp.extract_keywords("Damit galt es als so gut wie fix, dass Vueling den Zuschlag erhält.")# i would expect this to be empty

The problem can be fixed (for German) by adjusting the property non_word_boundaries:

kwp.non_word_boundaries = kwp.non_word_boundaries.union(list("ÖÄÜöäüß"))

Would you consider internationalizing the word boundaries or is this restrictive behavior on purpose?

Thanks, Alex

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/flashtext/issues/48, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-NwiQNXHCZuantgG-JVHKiV0wn1eTaks5tf9GSgaJpZM4SwZYs .

vi3k6i5 avatar Mar 19 '18 16:03 vi3k6i5