joni
joni copied to clipboard
Multiline Option with ^ and $ anchors
Hi,
I am struggling with proper configuration of Option
passed to search
method with the Syntax.ECMAScript
. I would expect that with Option.DEFAULT
/ Option.NONE
regex with usage of ^
,$
anchors and no explicit newline will fail with newline character. For example
byte[] pattern = "^[a-z]{1,10}$".getBytes();
byte[] str = "a\nb".getBytes();
Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
Matcher matcher = regex.matcher(str);
int result = matcher.search(0, str.length, Option.DEFAULT);
should results with -1 but currently results with 0. Even passing Option.SINGLELINE
does not change it. What I did to make this work, was to subtract the Option.MULTILINE
int result = matcher.search(0, str.length, -Option.MULTILINE)
I have tested this case with multiple online regex tools and JavaScript regex implementation in my browser and this example always gives me no match (as I expect). Only adding multiline option gives me similar result as with Joni library.
Setting syntax to Java works as expected and gives similar result as this snippet with built-in java regex
String pattern = "^[a-z]{1,10}$";
String str = "a\nb";
Pattern p = Pattern.compile(pattern);
java.util.regex.Matcher m = p.matcher(str);
boolean result = m.find();
Is the MULTILINE option default for library ECMAScript syntax and should it be? I was digging into the ECMAScript and looks like multiline = false
is the default (user has to explicitly pass m flag).
One more note, in this example
byte[] pattern = "^[a-z]{1,10}$".getBytes();
byte[] str = "ab\nab\n".getBytes();
Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
Matcher matcher = regex.matcher(str);
int result = matcher.search(0, str.length, -Option.MULTILINE);
result is equal 3. I think this should also be equal to -1 (no found).
I'm not familiar with the differences in the ECMAScript support in Joni but perhaps @lopex will have something more to say?
It might be worth us digging up some ECMAScript regex tests to verify whether this mode is working as it should.
What I found are official test cases for EcmaScript262 test262 but I did not find them really useful.
Much more readable are V8 tests (V8 is the JavaScript engine of Chrome, search for files named .*regexp.*js). For example there are test cases for multline flag.
Hi, did you have any chance to look at this issue? I would like to bring this thread back
Maybe the syntax settings just needs fixing ?
@kmalski @lopex and I realized the other week that the ECMA settings were made during the development of the now dead DynJS project and were not sourced from oniguruma. So it could very well be that Syntax for that mode is just not quite right. Not being JS devs we don't really know.
I have checked the oniguruma project and could not find syntax for ECMA (I believe there is no such). There are a lot of different options in this project, do you have any suggestions what is the best approach how to prepare best config for ECMA?
@kmalski I can see it is marked OP2_OPTION_PERL and that when it sees '^' will set multi true and single false. Not completely sure on direction here but ECMA OR'ing with PERL gives a bunch of default option twiddling in Parser.parseEnclose (look for syntax.op2OptionPerl()).
@kmalski I think the long term solution would be to remove OP2_OPTION_PERL from ECMA Syntax but this is more complicated since in Parser#parseEnclose we get a lot of behavior from it. As an intermediary step you can update case '^':
to twiddle options by adding some logic for syntax.op3OptionECMAScript()
. Notice it toggles 5 things. It appears 2 of those you do not want (e.g. your issue) but what about the other 3? I have no idea.
You could also just try removing OP2_OPTION_PERL and see if you can see anything break. I suspect yes but _RUBY does not set it and they have many similar features.