joni icon indicating copy to clipboard operation
joni copied to clipboard

Multiline Option with ^ and $ anchors

Open kmalski opened this issue 3 years ago • 10 comments

Hi,

I am struggling with proper configuration of Option passed to search method with the Syntax.ECMAScript. I would expect that with Option.DEFAULT / Option.NONE regex with usage of ^ ,$ anchors and no explicit newline will fail with newline character. For example

byte[] pattern = "^[a-z]{1,10}$".getBytes();
byte[] str = "a\nb".getBytes();

Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
Matcher matcher = regex.matcher(str);
int result = matcher.search(0, str.length, Option.DEFAULT);

should results with -1 but currently results with 0. Even passing Option.SINGLELINE does not change it. What I did to make this work, was to subtract the Option.MULTILINE

int result = matcher.search(0, str.length, -Option.MULTILINE)

I have tested this case with multiple online regex tools and JavaScript regex implementation in my browser and this example always gives me no match (as I expect). Only adding multiline option gives me similar result as with Joni library.

Setting syntax to Java works as expected and gives similar result as this snippet with built-in java regex

String pattern = "^[a-z]{1,10}$";
String str = "a\nb";

Pattern p = Pattern.compile(pattern);
java.util.regex.Matcher m = p.matcher(str);
boolean result = m.find();

Is the MULTILINE option default for library ECMAScript syntax and should it be? I was digging into the ECMAScript and looks like multiline = false is the default (user has to explicitly pass m flag).

kmalski avatar Jan 12 '22 12:01 kmalski

One more note, in this example

        byte[] pattern = "^[a-z]{1,10}$".getBytes();
        byte[] str = "ab\nab\n".getBytes();

        Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
        Matcher matcher = regex.matcher(str);
        int result = matcher.search(0, str.length, -Option.MULTILINE);

result is equal 3. I think this should also be equal to -1 (no found).

kmalski avatar Jan 13 '22 10:01 kmalski

I'm not familiar with the differences in the ECMAScript support in Joni but perhaps @lopex will have something more to say?

It might be worth us digging up some ECMAScript regex tests to verify whether this mode is working as it should.

headius avatar Jan 18 '22 15:01 headius

What I found are official test cases for EcmaScript262 test262 but I did not find them really useful.

Much more readable are V8 tests (V8 is the JavaScript engine of Chrome, search for files named .*regexp.*js). For example there are test cases for multline flag.

kmalski avatar Jan 19 '22 19:01 kmalski

Hi, did you have any chance to look at this issue? I would like to bring this thread back

kmalski avatar May 22 '23 21:05 kmalski

Maybe the syntax settings just needs fixing ?

lopex avatar May 22 '23 22:05 lopex

@kmalski @lopex and I realized the other week that the ECMA settings were made during the development of the now dead DynJS project and were not sourced from oniguruma. So it could very well be that Syntax for that mode is just not quite right. Not being JS devs we don't really know.

enebo avatar May 23 '23 14:05 enebo

I have checked the oniguruma project and could not find syntax for ECMA (I believe there is no such). There are a lot of different options in this project, do you have any suggestions what is the best approach how to prepare best config for ECMA?

kmalski avatar May 23 '23 19:05 kmalski

@kmalski I can see it is marked OP2_OPTION_PERL and that when it sees '^' will set multi true and single false. Not completely sure on direction here but ECMA OR'ing with PERL gives a bunch of default option twiddling in Parser.parseEnclose (look for syntax.op2OptionPerl()).

enebo avatar May 23 '23 20:05 enebo

@kmalski I think the long term solution would be to remove OP2_OPTION_PERL from ECMA Syntax but this is more complicated since in Parser#parseEnclose we get a lot of behavior from it. As an intermediary step you can update case '^': to twiddle options by adding some logic for syntax.op3OptionECMAScript(). Notice it toggles 5 things. It appears 2 of those you do not want (e.g. your issue) but what about the other 3? I have no idea.

enebo avatar May 23 '23 20:05 enebo

You could also just try removing OP2_OPTION_PERL and see if you can see anything break. I suspect yes but _RUBY does not set it and they have many similar features.

enebo avatar May 23 '23 20:05 enebo