compromise
compromise copied to clipboard
non-english parentheses tokenization
Noticed the lib is removing some symbols unexpectedly when running the json
methods.
Ex:
The same issue happens with foo[bar]
or foo{bar}
.
Expected behaviour:
nlp('foo{bar} foo').json() =>
[Object {]()
text: "foo(bar) foo"
terms: [Array(2) []()
0: [Object {]()text: "foo(bar)", tags: Array(2), pre: "", post: " "}
1: [Object {]()text: "foo", tags: Array(2), pre: "", post: ""}
]
hey @kant01ne - thanks for the good bug. yeah, the tokenizer attempts to put punctuation in the pre/post fields - but in this case, (if there is a matching bracket in the term text) i agree that it should retain the ending bracket. will add this on the list, for v14 cheers
Thanks 🙏 @spencermountain