compromise non-english parentheses tokenization

non-english parentheses tokenization

Open kant01ne opened this issue 2 years ago • 2 comments

Noticed the lib is removing some symbols unexpectedly when running the json methods.

Ex:

The same issue happens with foo[bar] or foo{bar}.

Expected behaviour:

nlp('foo{bar} foo').json() =>
[Object {]()
  text: "foo(bar) foo"
  terms: [Array(2) []()
  0: [Object {]()text: "foo(bar)", tags: Array(2), pre: "", post: " "}
  1: [Object {]()text: "foo", tags: Array(2), pre: "", post: ""}
]

Feb 16 '22 13:02 kant01ne

hey @kant01ne - thanks for the good bug. yeah, the tokenizer attempts to put punctuation in the pre/post fields - but in this case, (if there is a matching bracket in the term text) i agree that it should retain the ending bracket. will add this on the list, for v14 cheers

Feb 17 '22 17:02 spencermountain

Thanks 🙏 @spencermountain

Feb 18 '22 13:02 kant01ne

compromise compromise copied to clipboard

non-english parentheses tokenization

compromise
compromise copied to clipboard