CoreNLP
CoreNLP copied to clipboard
Tokensregex error with operator "+" (plus)
Hi, I just encountered this error while trying the tokensregex syntax at http://corenlp.run/
- Version 4.4.0
- Example of a working pattern:
the very* first? day of the tentacle
- Example of a failing pattern:
the very* first? day+ of the tentacle
It seems that the + character is escaped as \+ at some point of the process (see the error screenshot).
If I try the pattern the very* first? day{1,} of the tentacle, it works as expected.
I also tried to parse the same pattern with the CoreNLP Java library in version 4.4.0, and it works without error with the "+" operator.
String strPattern = "the very* first? day+ of the tentacle";
TokenSequenceParser parser = new TokenSequenceParser();
Env env = new Env(parser);
env.initDefaultBindings();
Pair<PatternExpr, SequenceMatchAction<CoreMap>> p = parser.parseSequenceWithAction(env, strPattern);
// => works without error !
I don't know if the problem is just present on the http://corenlp.run/ online tester, or in a Java lib that I haven't tried.
I've tried to avoid learning anything about Javascript when I can help it, but in the server .js file, this looks incorrect to me:
url: serverAddress + '/tokensregex?pattern=' + encodeURIComponent(
pattern.replace("&", "\\&").replace('+', '\\+')) +
I would think that the whole point of encoding the pattern with encodeURIComponent is to escape all special characters, so a second escaping of + and & shouldn't be necessary. At any rate, the server doesn't double unescape anything that I can see, so the patterns would be interpreted with \ in them and not function correctly.
https://github.com/stanfordnlp/CoreNLP/commit/8413fa1fc432aa2a13cbb4a296352bb9bad4d0cb
On Thu, Mar 3, 2022 at 2:55 AM PERANI Julien @.***> wrote:
Hi, I just encountered this error while trying the tokensregex syntax at http://corenlp.run/
- Version 4.4.0
- Example of a working pattern: the very* first? day of the tentacle [image: image] https://user-images.githubusercontent.com/4158840/156548894-b3202305-c95d-42b6-ba48-8aadcecf557b.png
- Example of a failing pattern: the very* first? day+ of the tentacle [image: image] https://user-images.githubusercontent.com/4158840/156549050-5ec6ab19-9f67-4d48-a5c0-9687e5590ab8.png
It seems that the + character is escaped as + at some point of the process (see the error screenshot). If I try the pattern the very* first? day{1,} of the tentacle, it works as expected.
I also tried to parse the same pattern with the CoreNLP Java library in version 4.4.0, and it works without error with the "+" operator.
String strPattern = "the very* first? day+ of the tentacle"; TokenSequenceParser parser = new TokenSequenceParser();Env env = new Env(parser); env.initDefaultBindings();Pair<PatternExpr, SequenceMatchAction<CoreMap>> p = parser.parseSequenceWithAction(env, strPattern); // => works without error !
I don't know if the problem is just present on the http://corenlp.run/ online tester, or in a Java lib that I haven't tried.
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1256, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWISIXGLA4MDXI4WLI3U6CLCLANCNFSM5P2DTGMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you are subscribed to this thread.Message ID: @.***>